Can recursive union find be optimized? - loops

When implementing union-find, I would usually write the find function with path compression like this:
def find(x):
if x != par[x]:
par[x] = find(par[x])
return par[x]
This is easy to remember and arguably easy to read. This is also how many books and websites describe the algorithm.
However, naively compiled, this would use stack memory linear in the input size. In many languages and systems that would by default result in a stack overflow.
The only non-recursive way I know of writing find is this:
def find(x):
p = par[x]
while p != par[p]:
p = par[p]
while x != p:
x, par[x] = par[x], p
return p
It seems unlikely that many compilers would find that. (Maybe Haskell would?)
My question is in what cases it is safe to use the former version of find? If no widely used language can remove the recursion, shouldn't we tell people to use the iterative version? And might there be a simpler iterative implementation?

There seem to be two separate questions here.
First - can optimizing compilers notice this and rewrite it? It's difficult to answer this question without testing all compilers and all versions. I tried this out using gcc 4.8.4 on the following code:
size_t find(size_t uf[], size_t index) {
if (index != uf[index]) {
uf[index] = find(uf, uf[index]);
}
return uf[index];
}
void link(size_t uf[], size_t i, size_t j) {
uf[find(uf, i)] = uf[find(uf, j)];
}
This doesn't implement the union-by-rank optimization, but does support path compression. I compiled this using optimization level -O3 and the assembly is shown here:
find:
.LFB23:
.cfi_startproc
pushq %r14
.cfi_def_cfa_offset 16
.cfi_offset 14, -16
pushq %r13
.cfi_def_cfa_offset 24
.cfi_offset 13, -24
pushq %r12
.cfi_def_cfa_offset 32
.cfi_offset 12, -32
pushq %rbp
.cfi_def_cfa_offset 40
.cfi_offset 6, -40
pushq %rbx
.cfi_def_cfa_offset 48
.cfi_offset 3, -48
leaq (%rdi,%rsi,8), %rbx
movq (%rbx), %rax
cmpq %rsi, %rax
je .L2
leaq (%rdi,%rax,8), %rbp
movq 0(%rbp), %rdx
cmpq %rdx, %rax
je .L3
leaq (%rdi,%rdx,8), %r12
movq %rdx, %rax
movq (%r12), %rcx
cmpq %rcx, %rdx
je .L4
leaq (%rdi,%rcx,8), %r13
movq %rcx, %rax
movq 0(%r13), %rdx
cmpq %rdx, %rcx
je .L5
leaq (%rdi,%rdx,8), %r14
movq %rdx, %rax
movq (%r14), %rsi
cmpq %rsi, %rdx
je .L6
call find // <--- Recursion!
movq %rax, (%r14)
.L6:
movq %rax, 0(%r13)
.L5:
movq %rax, (%r12)
.L4:
movq %rax, 0(%rbp)
.L3:
movq %rax, (%rbx)
.L2:
popq %rbx
.cfi_def_cfa_offset 40
popq %rbp
.cfi_def_cfa_offset 32
popq %r12
.cfi_def_cfa_offset 24
popq %r13
.cfi_def_cfa_offset 16
popq %r14
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Given the existence of the recursive call in the middle, it doesn't look like this tail call was eliminated. In fairness, that's because the transformation you're describing is pretty nontrivial, so I'm not surprised it didn't find it. This doesn't mean that no optimizing compiler can find it, but that one major one won't.
Your second question is why we present the algorithm this way. As someone who teaches both algorithms and programming, I think it's extremely valuable to discuss algorithms using a presentation that's as simple as possible, even if it means abstracting away some particular implementation details. Here, the key idea behind the algorithm is to update the parent pointers of all the nodes encountered up on the way to the representative. Recursion happens to be a pretty clean way of describing that idea, even though when implemented naively it risks a stack overflow. However, by expressing the pseudocode in that particular way, it's easier to describe and discuss it and to prove that it will work as advertised. We could describe it the other way to avoid a stack overflow, but in Theoryland we usually don't worry about details like that and the updated presentation, while more directly translatable into practice, would make it harder to see the key ideas.
When looking at pseudocode for more advanced algorithms and data structures, it's common to omit critically important implementation details and to handwave that certain tasks are possible to do in certain time bounds. When discussing algorithms or data structures that build on top of even more complex algorithms and data structures, it often becomes impossible to write out pseudocode for everything because you have layers on top of layers on top of layers of glossed-over details. From a theoretical perspective, this is fine - if the reader does want to implement it, they can fill in the blanks. On the other hand, if the reader is more interested in the key techniques from the paper and the theory (which in academic settings is common), they won't get bogged down in implementation details.

Related

Assembly of variable-size stack frame: these stack-alignment instructions seem useless in allocating a VLA?

I'm reading Computer Systems: A Programmer's Perspective 3rd edition and the assembly in 3.10.5 Supporting Variable-Size Stack Frames, Figure 3.43 confuses me.
The part of the book is trying to explain how a variable-size stack frame is generated and it gives a C code and its assembly version as an example.
Here is the code of C and assembly(Figure 3.43 of the book):
I don't know what the use of line 8-10 in the assembly is. Why not just use movq %rsp, %r8after line 7?
(a) C code
long vframe(long n, long idx, long *q) {
long i;
long *p[n];
p[0] = &i;
for (i = 1; i < n; i++)
p[i] = q;
return *p[idx];
}
(b) Portions of generated assembly code
vframe:
2: pushq %rbp
3: movq %rsp, %rbp
4: subq $16, %rsp
5: leaq 22(, %rdi, 8), %rax
6: andq $-16, %rax
7: subq %rax, %rsp
8: leaq 7(%rsp), %rax
9: shrq $3, %rax
10: leaq 0(, %rax, 8), %r8
11: movq %r8, %rcx
................................
12: L3:
13: movq %rdx, (%rcx, %rax, 8)
14: addq $1, %rax
15: movq %rax, -8(%rbp)
16: L2:
17: movq -8(%rbp), %rax
18: cmpq %rdi, %rax
19: jl L3
20: leave
21: ret
Here is what I think:
After line 7, the %rsp should be a multiple of 16 (%rsp should be a multiple of 16 before vframe is called because of stack frame alignment. When vframe is called, %rsp is subtracted by 8 to hold the return address of the caller, and then the pushq instruction in line 2 subtracts %rsp by another 8, and in line 4 a 16. So at the start of line 7, %rsp is a multiple of 16. In line 7, %rsp is subtracted by %rax. Since line 6 makes %rax a multiple of 16, the result of line 7 is setting %rsp a multiple of 16) which means the lower 4 bits of %rsp are all zeros.
Then in line 8, %rsp+7 is stored in %rax, and in line 9 %rax is shifted right logically by 3 bits, and in line 10, %rax*8 is stored in %r8.
After line 7, the lower 4 bits of %rsp are all zeros. In line 8 %rsp+7 just makes the lower 3 bits all ones, and line 9 truncates these 3 ones, and in line 10 %rax*8 makes the result shift left by 3 bits. So the final result should just be the original %rsp (the result of line 7).
So I wonder whether line 8-10 are useless.
Why not just use movq %rsp, %r8 after line 7 and remove the original line 8-10?
I thought that a useful exploratory program would be to reduce your generated code to:
.globl _vframe
_vframe:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 22(, %rdi, 8), %rax
andq $-16, %rax
subq %rax, %rsp
leaq 7(%rsp), %rax
shrq $3, %rax
leaq 0(, %rax, 8), %r8
mov %r8, %rax
sub %rsp, %rax
leave
ret
Note that I just eliminated the code that did anything useful, and returned the difference between %r8 and %rsp.
Then wrote a driver:
extern void *vframe(unsigned long n);
#include <stdio.h>
int main(void) {
int i;
for (i = 0; i < (1<<18); i++) {
void *p = vframe(i);
if (p) {
printf("%d %p\n", i, p);
}
}
return 0;
}
to check it out. They were always the same. So, why? It may be that it is a standard code emission when confronted with a given construct (var len array). The compiler has to maintain certain standards, such as traceable call frames and alignment, os might just emit this code as the known solution to that. Variable length arrays are generally considered a mistake in the language; a tribute to c++, adding a half-working, half-thought-out mechanism to C; so compiler implementors might not give to much attention to the code generated on their behalf.

How do bit-fields work in C?

I just learned about bit-fields in C, and I became curious about how the compiler implement this feature. As far as my knowledge about the C compiler goes, single bits cannot be accessed individually.
Bit-fields are implemented by reading the surrounding addressable unit of memory (byte or word), masking and shifting.
More precisely, reading a bit-field is implemented as read-shift-mask, and writing to a bit-field is implemented as read-mask-shift value to write-or-write.
This is pretty expensive, but if you intend to store data compactly and are willing to pay the price of the bitwise operations, then bit-fields offer a clearer, lighter syntax at the source level for the same operations that you could have written by hand. What you lose is control of the layout (the standard does not specify how bit-fields are allocated from a containing word, and this will vary from compiler to compiler more than the meaning of bitwise operations does).
Whenever you have doubts about what a C compiler does for a given construct, you can always read the assembly code:
struct s {
unsigned int a:3;
unsigned int b:3;
} s;
void f(void)
{
s.b = 5;
}
int g(void)
{
return s.a;
}
This is compiled by gcc -O -S to:
_f: ## #f
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
movq _s#GOTPCREL(%rip), %rax
movb (%rax), %cl ; read
andb $-57, %cl ; mask
orb $40, %cl ; since the value to write was a constant, 5, the compiler has pre-shifted it by 3, giving 40
movb %cl, (%rax) ; write
popq %rbp
retq
.cfi_endproc
.globl _g
.align 4, 0x90
_g: ## #g
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp7:
.cfi_def_cfa_offset 16
Ltmp8:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp9:
.cfi_def_cfa_register %rbp
movq _s#GOTPCREL(%rip), %rax
movzbl (%rax), %eax
andl $7, %eax
popq %rbp
retq
.cfi_endproc

Setting up local stack according to x86-64 calling convention on linux

I am doing some extended assembly optimization on gnu C code running on 64 bit linux. I wanted to print debugging messages from within the assembly code and that's how I came accross the following. I am hoping someone can explain what I am supposed to do in this situation.
Take a look at this sample function:
void test(int a, int b, int c, int d){
__asm__ volatile (
"movq $0, %%rax\n\t"
"pushq %%rax\n\t"
"popq %%rax\n\t"
:
:"m" (a)
:"cc", "%rax"
);
}
Since the four agruments to the function are of class INTEGER, they will be passed through registers and then pushed onto the stack. The strange thing to me is how gcc actually does it:
test:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movl %ecx, -16(%rbp)
movq $0, %rax
pushq %rax
popq %rax
popq %rbp
ret
The passed arguments are pushed onto the stack, but the stack pointer is not decremented. Thus, when I do pushq %rax, the values of a and b are overwritten.
What I am wondering: is there a way to ask gcc to properly set up the local stack? Am I simply not supposed to use push and pop in function calls?
x86-64 abi provides a 128 byte red zone under the stack pointer, and the compiler decided to use that. You can turn that off using -mno-red-zone option.

Why does gcc create redundant assembly code?

I wanted to look into how certain C/C++ features were translated into assembly and I created the following file:
struct foo {
int x;
char y[0];
};
char *bar(struct foo *f)
{
return f->y;
}
I then compiled this with gcc -S (and also tried with g++ -S) but when I looked at the assembly code, I was disappointed to find a trivial redundancy in the bar function that I thought gcc should be able to optimize away:
_bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movabsq $4, %rcx
addq %rcx, %rax
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
popq %rbp
ret
Leh_func_end1:
Among other things, the lines
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
seem pointlessly redundant. Is there any reason gcc (and possibly other compilers) cannot/does not optimize this away?
I thought gcc should be able to optimize away.
From the gcc manual:
Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results.
In other words, it doesn't optimize unless you ask it to. When I turn on optimizations using the -O3 flag, gcc 4.4.6 produces much more efficient code:
bar:
.LFB0:
.cfi_startproc
leaq 4(%rdi), %rax
ret
.cfi_endproc
For more details, see Options That Control Optimization in the manual.
The code the compiler generates without optimization is typically a straight instruction-by-instruction translation, and the instructions are not those of the program but those of an intermediate representation in which redundancy may have been introduced.
If you expect assembly without such redundant instructions, use gcc -O -S
The kind of optimization you were expecting is called peephole optimization. Compilers usually have plenty of these, because unlike more global optimizations, they are cheap to apply and (generally) do not risk making the code worse—if applied towards the end of the compilation, at least.
In this blog post, I provide an example where both GCC and Clang may go as far as generating shorter 32-bit instructions when the integer type in the source code is 64-bit but only the lowest 32-bit of the result matter.

Not getting Segmentation Fault in C

here is the c code:
char **s;
s[334]=strdup("test");
printf("%s\n",s[334]);`
i know that strdup does the allocation of "test", but the case s[334] where we will put the pointer to the string "test" is not allocated,however,this code works like a charm
Your code exhibits undefined behavior. That does not mean it will crash. All it means is that you can't predict anything about what will happen.
A crash is rather likely, but not guaranteed at all, in this case.
"Undefined behaviour" doesn't mean you'll get a segfault, it means you might get a segfault. A conforming implementation might also decide to display ASCII art of a puppy.
You might like to check this code with a tool like Valgrind.
You don't always get segmentation fault if you access uninitialized memory.
You do access uninitialized memory here.
The compiler is too smart for us! It knows that printf("%s\n", some_string) is exactly the same as puts(some_string), so it can simplify
char **s;
s[334]=strdup("test");
printf("%s\n",s[334]);
into
char **s;
s[334]=strdup("test");
puts(s[334]);
and then (assuming no UB) that is again equivalent to
puts(strdup("test"));
So, by chance the segment fault didn't happen (this time).
I get a segfault without optimisations, but when compiled with optimisations, gcc doesn't bother with the s at all, it's eliminated as dead code.
gcc -Os -S:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi # .LC0 is where "test" is at
call strdup
addq $8, %rsp
.cfi_def_cfa_offset 8
movq %rax, %rdi
jmp puts
.cfi_endproc
gcc -S -O (same for -O2, -O3):
.LFB23:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $5, %edi
call malloc
movq %rax, %rdi
testq %rax, %rax
je .L2
movl $1953719668, (%rax)
movb $0, 4(%rax)
.L2:
call puts
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc

Resources