How do I transfer control after a loop in assembly?

How do I transfer control after a loop in assembly? - c

I would like to translate the C code below into assembly language.
However, I do not see that I need to use the stack in this example.
Moreover, I'd like to know whether or not "beq" saves the address of the following instruction in $ra like "jal" does, for when the loop ends, I would like to get back to the original function foo, and continue the instructions (which here is simply returning.)
int foo(int* a, int N) {
if(N > 0)
{
for(int i = 0; i != N; i = i + 1)
{
a[i] = bar(i << 4, a[i]);
}
}
return N & 7;
}
#assume *a in $a0, N $N in $a1
foo:
slt $t0, $zero, $a1 #put 1 in $t0 if 0 < N
li $t1,0 # use $t1 as loop counter
beq $t0, 1, loop # enter loop if 0 < N
and $v0, $a1, 7 # do bitwise and on N and 7 and save in $v0 as return value
loop:
beq $t1, $a1, exit # exit loop when i = N
sll $t3, $t1, 2 # obtain 4 * i
add $t3, $a1, $t3 # obtain address of a[i] which is address of a plus 4i
lw $t3, o($t3) # load a[i] into $t3
sll $t4, $t1, 4 #perform i<< 4 and save in $t4
# the 2 previous load arguments for bar
jal bar # assume bar saves return value in $v2
sw $t3, 0($v1)
j loop
exit:
and $v0, $a1, 7

beq is for conditional branching, not calling — it changes the PC (conditionally) but not $ra. We use it to translate structured statements (e.g. if, for) into the if-goto style of assembly language.
However, I do not see that I need to use the stack in this example.
You must to use the stack for this code because the call to bar (as in jal bar) will wipe out foos $ra, and while bar will be able to return back to foo, foo will not be able to return to its caller. Since this requires a stack, you will need prologue and epilogue to allocate and release some stack space.
Your code is not properly passing parameters to bar, i << 4, for example, should be passed in $a0, while a[i] should be passed in $a1.
You do not have a return instruction in foo — it is missing a jr $ra.

If either of your beq instructions did set $ra, those wouldn't be useful points to return back to. But since you asked:
I'd like to know whether or not "beq" saves the address of the following instruction in $ra like "jal" does
If the instruction mnemonic doesn't end with al (which stands for And Link), it doesn't save a return address in $ra.
Classic MIPS has the following instructions that link, from this somewhat incomplete reference (missing nor and IDK what else).
jal target (Jump And Link)
BGEZAL $reg, target (conditional Branch if >= 0 And Link)
BLTZAL $reg, target (conditional Branch if < 0 And Link)
Note that the conditional branches are effectively branching on the sign bit of the register.
bal is an alias for bgezal $zero, target, useful for doing a PC-relative function call. (MIPS branches use a fully relative encoding for branch displacement, MIPS jumps use a region-absolute encoding that replaces the low 28 bits of PC+4. This matters for position-independent code).
None of this is particularly relevant to your case; your foo needs to save/restore $ra on entry/before jr $ra because you need to call bar with a jal or bal. Using a linking branch as the loop branch wouldn't affect anything (except to make your code even less efficient, and make performance worse on real CPUs that do return-address prediction with a special predictor that assumes jal and jr $ra are paired properly).
Using bal / jal doesn't automatically make the thing you jump to ever return; that only happens if the target ever uses jr $ra (potentially after copying $ra somewhere else then restoring it).

Related

Quicksort from C to MIPS - How to pass parameters and maintaining variables for stack frame?

I am creating a Quicksort algorithm on an array of integers. I am using this C algorithm and translating it into MIPS. However, MIPS and recursion is very tough indeed.
I am unsure how to send parameters into the recursive call QS. I recently discovered that I can change my $s registers for each frame in the call stack, by moving the stack pointer 4 bytes. This will allow me to change the $s registers for each stack frame such that I don't need a million variables for each QS frame.
My problem is that I don't really understand how and when to set and get these $sx values during recursion.

Recursion is implemented by moving the stack pointer register ($sp).
First of all, let's understand the point of moving the stack pointer:
When you use recursion in a high level language, what it does, basically, is to "save" the state of the current function call in the "stack memory".
To achieve this, you will have to:
Save the current state of your program (all the variables/registers
you are using within the scope of the "function"), in the stack memory;
Call the function "recursively" (which might modify all the registers you were using);
When the function finishes, you have to restore the previous state and "free" the space you allocated.
But besides that, we have to save the value of $ra, to keep track of where we're supposed to go when the upper function ends.
Here's a simple example of a program that calculates factorial(n) recursively:
.text
main:
# Calls Fact with Input ($a0) N = 10
li $a0, 10
jal fact
# prints the Output ($v0) Factorial(N)
move $a0, $v0
li $v0, 1
syscall
# exit
li $v0, 10
syscall
# Input: $a0 - N
# Output: $v0 - Factorial(N)
fact:
# Fact(0) = 1
beq $a0, 0, r_one
# Fact(N) = N * Fact(N-1) use recursion
# allocate 8 bytes in the stack for storing N, and $ra
addi $sp, $sp, -8
# stores N in the first, and $ra in the last position
sw $a0, 4($sp)
sw $ra, 0($sp)
# call Fact(N-1)
addi $a0, $a0, -1
jal fact
# Restore the values of N and $ra
lw $a0, 4($sp)
lw $ra, 0($sp)
# Free the 8 bytes used
addi $sp, $sp, 8
# Set the return value to be N * Fact(N-1) and return
mul $v0, $a0, $v0
jr $ra
# return 1;
r_one:
li $v0, 1
jr $ra
This is what you should keep in mind when implementing your code, basically.
Just pay attention to:
The stack pointer is decremented;
How many bytes you need to allocate. In this example I use 2 32-bits integers, 8 bytes in total. It will deppend on how many variables you need to store, and their size.
How to access them with lw and sw, using the correct index. Also, be aware of memory alignment;
This does not only apply for recursion. You can use the stack memory to call another function that uses registers that are being used (basically the same thing as recursion, except that you don't need to save $ra). And also store an array, a struct, etc.
Edit:
Some considerations:
The right place to do that is where your code calls the function (allocate and save), and after this call (restore and free).
Understand your code to know which variables need to be saved (might be used).

C ternary conditional operator to MIPS assembly with one side loaded from memory

C statement
A= A? C[0] : B;
Is is correct to write in assembly instruction this way?
Assuming $t1=A, $t2=B, $s1=base address of Array C:
beq $t1, $0, ELSE
lw $t1, 0($s1)
ELSE: add $t1, $t2, $0

No, it doesn't seem correct because add $t1, $t2, $0 will be executed even if $t1 != $0.
I hope that this works (not tested):
beq $t1, $0, ELSE
sll $0, $0, 0 # NOP : avoid the instruction after branch being executed
lw $t1, 0($s1)
j END
sll $0, $0, 0 # NOP : avoid the instruction after branch being executed
ELSE: add $t1, $t2, $0
END:
This code assumes that the elements of C are 4-byte long each.

You can avoid an unconditional j. Instead of structuring this as an if/else, always do the A=B (because copying a register is cheaper than jumping) and then optionally do the load.
On a MIPS with branch-delay slots, the delay slot actually helps us:
# $t1=A, $t2=B, $s1=base
beq $t1, $zero, noload
move $t1, $t2 # branch delay: runs always
lw $t1, 0($s1)
noload:
# A= A? C[0] : B;
On a MIPS without branch-delay slots (like MARS or SPIM in their default config):
# MIPS without branch-delay slots
# $t1=A, $t2=B, $s1=base
move $t3, $t1 # tmp=A
move $t1, $t2 # A=B
beq $t3, $zero, noload # test the original A
lw $t1, 0($s1)
noload:
# $t1 = A= A ? C[0] : B;
If we can clobber B and rearrange registers, we can save an insn without a branch-delay:
# MIPS without branch-delay slots
# $t1=A, $t2=B, $s1=base
beq $t1, $zero, noload
lw $t2, 0($s1)
noload:
# $t2 = A. B is "dead": we no longer have it in a register
A move $t3, $t2 before the BEQ could save B in another register before this sequence. Ending up with your variables in different registers can save instructions, but makes it harder to keep track of things. In a loop, you can get away with this if you're unrolling the loop because the 2nd copy of the code can re-shuffle to get registers back the way they need to be for the top of the loop.
move x, y is a pseudo-instruction for or x, y, $zero or ori x, y, 0. Or addu x, y, $zero. Implement it however you like or let your assembler do it for you.

Write a MIPS segment for the C statement
x=5; y=7;
Land(x,y,z) // z=x &y as procedure call
if (z>x) y=x+1

C Programming to MIPS Assembly (for Loops)

I'm Trying to convert this C code to MIPS assembly and I am unsure if it is correct. Can someone help me? Please
Question : Assume that the values of a, b, i, and j are in registers $s0, $s1, $t0, and $t1, respectively. Also, assume that register $s2 holds the base address of the array D
C Code :
for(i=0; i<a; i++)
for(j=0; j<b; j++)
D[4*j] = i + j;
My Attempt at MIPS ASSEMBLY
add $t0, $t0, $zero # i = 0
add $t1, $t1, $zero # j = 0
L1 : slt $t2, $t0, $s0 # i<a
beq $t2, $zero, EXIT # if $t2 == 0, Exit
add $t1, $zero, $zero # j=0
addi $t0, $t0, 1 # i ++
L2 : slt $t3, $t1, $s1 # j<b
beq $t3, $zero, L1, # if $t3 == 0, goto L1
add $t4, $t0, $t1 # $t4 = i+j
muli $t5, $t1, 4 # $t5 = $t1 * 4
sll $t5, $t5, 2 # $t5 << 2
add $t5, $t5, $s2 # D + $t5
sw $t4, $t5($s2) # store word $t4 in addr $t5(D)
addi $t0, $t1, 1 # j ++
j L2 # goto L2
EXIT :

add $t0, $t0, $zero # i = 0 Nope, that leaves $t0 unmodified, holding whatever garbage it did before. Perhaps you meant to use addi $t0, $zero, 0?
Also, MIPS doesn't have 2-register addressing modes (for integer load/store), only 16-bit-constant ($reg). $t5($s2) isn't legal. You need a separate addu instruction, or better a pointer-increment.
(You should use addu instead of add for pointer math; it's not an error if address calculation crosses from the low half to high half of address space.)
In C, it's undefined behaviour for another thread to be reading an object while you're writing it, so we can optimize away the actual looping of the outer loop. Unless the type of D is _Atomic int *D or volatile int *D, but that isn't specified in the question.
The inner loop writes the same elements every time regardless of the outer loop counter, so we can optimize away the outer loop and only do the final outer iteration, with i = a-1. Unless a <= 0, then we must skip the outer loop body, i.e. do nothing.
Optimizing away all but the last store to every location is called "dead store elimination". The stores in earlier outer-loop iterations are "dead" because they're overwritten with nothing reading their value.
You normally want to put the loop condition at the bottom of the loop, so the loop branch is a bne $t0, $t1, top_of_loop for example. (MIPS has bne as a native hardware instruction; blt is only a pseudo-instruction unless the 2nd register is $zero.) So we want to optimize j<b to j!=b because we know we're counting upward.
Put a conditional branch before the loop to check if it might need to run zero times. e.g. blez $s0, after_loop to skip the inner loop body if b <= 0.
An idiomatic for(i=0 ; i<a ; i++) loop in asm looks like this in C (or some variation on this).
if(a<=0) goto end_of_loop;
int i=0;
do{ ... }while(++i != a);
Or if i isn't used inside the loop, then i=a and do{}while(--i). (i.e. add -1 and use bnez). Although MIPS can branch just as efficiently on i!=a as it can on i!=0, unlike most architectures with a FLAGS register where counting down saves a compare instruction.
D[4*j] means we stride by 16 bytes in a word array. Separately using a multiply by 4 and a shift by 2 is crazy redundant. Just keep a pointer in a separate register an increment it by 16 every iteration, like a C compiler would.
We don't know the type of D, or any of the other variables for that matter. If any of them are narrow unsigned integers, we might need to implement 8 or 16-bit truncation/wrapping.
But your implementation assumes they're all int or unsigned, so let's do that.
I'm assuming a MIPS without branch-delay slots, like MARS simulates by default.
i+j starts out (with j=0) as a-1 on the last outer-loop iteration that sets the final value. It runs up to j=b-1, so the max value is a-1 + b-1.
Simplifying the problem down to the values we need to store, and the locations we need to store them in, before writing any asm, means the asm we do write is a lot simpler and easier to debug.
You could check the validity of most of these transformations by doing them in C source and checking with a unit test in C.
# int a: $s0
# int b: $s1
# int *D: $s2
# Pointer to D[4*j] : $t0
# int i+j : $t1
# int a-1 + b : $t2 loop bound
blez $s0, EXIT # if(a<=0) goto EXIT
blez $s1, EXIT # if(b<=0) goto EXIT
# now we know both a and b loops run at least once so there's work to do
addiu $t1, $s0, -1 # tmp = a-1 // addu because the C source doesn't do this operation, so we must not fault on signed overflow here. Although that's impossible because we already excluded negatives
addu $t2, $t1, $s1 # tmp_end = a-1 + b // one past the max we store
add $t0, $s2, $zero # p = D // to avoid destroying the D pointer? Otherwise increment it.
inner: # do {
sw $t1, ($t0) # tmp = i+j
addiu $t1, $t1, 1 # tmp++;
addiu $t0, $t0, 16 # 4*sizeof(*D) # could go in the branch-delay slot
bne $t1, $t2, inner # }while(tmp != tmp_end)
EXIT:
We could have done the increment first, before the store, and used a-2 and a+b-2 as the initializer for tmp and tmp_end. On some real pipelined/superscalar MIPS CPUs, that might be better to avoid putting the increment right before the bne that reads it. (After moving the pointer-increment into the branch-delay slot). Of course you'd actually unroll to save work, e.g. using sw $t1, 16($t0) and 32($t0) / 48($t0).
Again on a real MIPS with branch delays, you'd move some of the init of $t0..2 to fill the branch delay slots from the early-out blez instructions, because they couldn't be adjacent.
So as you can see, your version was over-complicated to say the least. Nothing in the question said we have to transliterate each C expression to asm separately, and the whole point of C is the "as-if" rule that allows optimizations like this.

This similar C code compiles and translates to MIPS:
#include <stdio.h>
main()
{
int a,b,i,j=5;
int D[50];
for(i=0; i<a; i++)
for(j=0; j<b; j++)
D[4*j] = i + j;
}
Result:
.file 1 "Ccode.c"
# -G value = 8, Cpu = 3000, ISA = 1
# GNU C version cygnus-2.7.2-970404 (mips-mips-ecoff) compiled by GNU C version cygnus-2.7.2-970404.
# options passed: -msoft-float
# options enabled: -fpeephole -ffunction-cse -fkeep-static-consts
# -fpcc-struct-return -fcommon -fverbose-asm -fgnu-linker -msoft-float
# -meb -mcpu=3000
gcc2_compiled.:
__gnu_compiled_c:
.text
.align 2
.globl main
.ent main
main:
.frame $fp,240,$31 # vars= 216, regs= 2/0, args= 16, extra= 0
.mask 0xc0000000,-4
.fmask 0x00000000,0
subu $sp,$sp,240
sw $31,236($sp)
sw $fp,232($sp)
move $fp,$sp
jal __main
li $2,5 # 0x00000005
sw $2,28($fp)
sw $0,24($fp)
$L2:
lw $2,24($fp)
lw $3,16($fp)
slt $2,$2,$3
bne $2,$0,$L5
j $L3
$L5:
.set noreorder
nop
.set reorder
sw $0,28($fp)
$L6:
lw $2,28($fp)
lw $3,20($fp)
slt $2,$2,$3
bne $2,$0,$L9
j $L4
$L9:
lw $2,28($fp)
move $3,$2
sll $2,$3,4
addu $4,$fp,16
addu $3,$2,$4
addu $2,$3,16
lw $3,24($fp)
lw $4,28($fp)
addu $3,$3,$4
sw $3,0($2)
$L8:
lw $2,28($fp)
addu $3,$2,1
sw $3,28($fp)
j $L6
$L7:
$L4:
lw $2,24($fp)
addu $3,$2,1
sw $3,24($fp)
j $L2
$L3:
$L1:
move $sp,$fp # sp not trusted here
lw $31,236($sp)
lw $fp,232($sp)
addu $sp,$sp,240
j $31
.end main

Converting C code to MIPS (arrays)

for (i = 0; i < 64; i++) {
A[i] = B[i] + C[i];
}
The MIPS instructions for the above C code is:
add $t4, $zero, $zero # I1 i is initialized to 0, $t4 = 0
Loop:
add $t5, $t4, $t1 # I2 temp reg $t5 = address of b[i]
lw $t6, 0($t5) # I3 temp reg $t6 = b[i]
add $t5, $t4, $t2 # I4 temp reg $t5 = address of c[i]
lw $t7, 0($t5) # I5 temp reg $t7 = c[i]
add $t6, $t6, $t7 # I6 temp reg $t6 = b[i] + c[i]
add $t5, $t4, $t0 # I7 temp reg $t5 = address of a[i]
sw $t6, 0($t5) # I8 a[i] = b[i] + c[i]
addi $t4, $t4, 4 # I9 i = i + 1
slti $t5, $t4, 256 # I10 $t5 = 1 if $t4 < 256, i.e. i < 64
bne $t5, $zero, Loop # I11 go to Loop if $t4 < 256
For I8, could the sw instruction not be replaced with an addi instruction? i.e addi $t5, $t6, 0
Wouldn't it achieve the same task of copying the address of $t6 into $t5? I would like to know the difference and when to use either of them. Same could be said about the lw instruction.
Also, maybe a related question, how does MIPS handle pointers?
edit: changed addi $t6, $t5, 0.

The sw instruction in MIPS stores the first argument (value in $t6) to the address in the second argument (value in $t5) offset by the constant value (0).
You're not actually trying to store the $t5 address into a register, but rather storing the value in $t6 into the memory location represented by the value of $t5.
If you like, you could consider the value in $t5 to be analogous to a C pointer. In other words, MIPS does not handle pointers vs values differently-- all that matters is where you use the values. If you use a register's value as the second argument to lw or sw, then you are effectively using that register as a pointer. If you use a register's value as the first argument to lw or sw, or in most other places, you are operating directly on the value. (Of course, just like in C pointer arithmetic, you might manipulate an address so you can store a piece of data somewhere else in memory.)

For I8, could the sw instruction not be replaced with an addi instruction? i.e addi $t6, $t5, 0
No. The sw instruction stores the result to memory. The add just manipulates registers. And lw gets a word from memory. It's the only MIPS instruction that does so. (Other processors might and do have versions of add that access memory, but not MIPS.)
It's necessary to adjust your thinking when working in assembly language. Registers and memory are separate. In higher level languages, registers are (nearly) completely hidden. In assembly, registers are a separate resource that the programmer must manage. And they're a scarce resource. A HLL compiler would do this for you, but by programming in assembly, you have taken the job for yourself.
how does MIPS handle pointers?
In MIPS, pointers are just integers (in registers or memory) that happen to be memory addresses. The only way they're distinguished from data values is by your brain. The "pointer" is something invented by higher level language designers to relieve you the programmer of this burden. If you look closely, you'll see that $t5 actually holds a pointer. It's a memory address used by lw and sw as the address to load from or store to.

For I8, could the sw instruction not be replaced with an add instruction? Wouldn't it achieve the same task of copying the address of $t5 into $t0? I would like to know the difference and when to use either of them.
I think you are confused with what a store word actually does. In I8, the value of the register in $t6 is being stored into $t5 at position zero. An add will overwrite whatever data is stored in the destination register with the sum of the two other registers' values.
Also, maybe a related question, how does MIPS handle pointers?
The "pointers" are just addresses in memory stored in the registers (as opposed to values).

lw and sw read/write to memory. addi and other arithmetic operations operate on registers.
Registers are like little buckets the CPU uses to store data. They can be addressed in 5 bits or so if I remember my MIPS architecture correctly.
Memory is like a vast ocean of data that requires well over 16 bits to address. So you actually have to store the address in a register.
Pointers are simply memory addresses (32 bit on a 32 bit architecture).

Is this MIPS strlen correctly converted from the corresponding C loop?

I have a simple question for a Comp Sci class I'm taking where my task is to convert a function into MIPS assembly language. I believe I have a correct answer but I want to verify it.
This is the C function
int strlen(char *s) {
int len;
len=0;
while(*s != '\0') {
len++;
s++;
}
return len;
}
Thanks!
strlen:
add $v0, $zero, $zero # len = 0
loop: # do{
lbu $t0, 0($a0) # tmp0 = load *s
addi $a0, $a0, 1 # s++
addi $v0, $v0, 1 # len++
bne $t0, $zero, loop # }while(tmp0 != 0)
s_end:
addi $v0, $v0, -1 # undo counting of the terminating 0
j $ra

Yeah, you have a correct asm version, and I like the fact that you do as much work as possible before testing the value of t0 to give as much time as possible for loading from memory.

(Editor's note: the add of -1 after the loop corrects for off by 1 while still allowing an efficient do{}while loop structure. This answer proposes a more literal translation from C into an if() break inside an unconditional loop.)
I think the while loop isn't right in the case of *s == 0.
It should be something like this:
...
lbu $t0, 0($a0)
loop:
beq $t0, $zero, s_end # *
...
b loop
s_end:
...
*You could use a macro instruction (beqz $t0, s_end) instead of beq instruction.

Yes, looks correct to me, and fairly efficient. Implementing a while loop with asm structured like a do{}while() is the standard and best way to loop in asm. Why are loops always compiled into "do...while" style (tail jump)?
A more direct transliteration of the C would check *s before incrementing len.
e.g. by peeling the first iteration and turning it into a load/branch that can skip the whole loop for an empty string. (And reordering the loop body, which would probably put the load close to the branch, worse for performance because of load latency.)
You could optimize away the len-- overshoot-correction after the loop: start with len=-1 instead of 0. Use li $v0, -1 which can still be implemented with a single instruction:
addiu $v0, $zero, -1
A further step of optimization is to only do the pointer increment inside the loop, and find the length at the end with len = end - start.
We can correct for the off-by-one (to not count the terminator) by offsetting the incoming pointer while we're copying it to another reg.
# char *s input in $a0, size_t length returned in $v0
strlen:
addiu $v0, $a0, 1 # char *start_1 = start + 1
loop: # do{
lbu $t0, ($a0) # char tmp0 = load *s
addiu $a0, $a0, 1 # s++
bne $t0, $zero, loop # }while(tmp0 != '\0')
s_end:
subu $v0, $a0, $v0 # size_t len = s - start
jr $ra
I used addiu / subu because I don't want it to fault on signed-overflow of a pointer. Your version should probably use addiu as well so it works for strings up to 4GB, not just 2.
Untested, but we can think through the correctness:
For an empty string input (s points at a 0): when we reach the final subtract, we have v0=s+1 (from before the loop) and a0=s+1 (from the first/only iteration which falls through because it loads $t0 = 0). Subtracting these gives len=0 = strlen("")
For a length=1 string: v0=s+1, but the loop body runs twice so we have a0=s+2. len = (s+2) - (s+1) = 1.
By induction, larger lengths work too.
For MIPS with a branch-delay slot, the addiu and subu can be reordered after bne and jr respectively, filling those branch-delay slots. (But then bne is right after the load so classic MIPS would have to stall, or even fill the load-delay slot with a nop on a MIPS I without interlocks for loads).
Of course if you actually care about real-world strlen performance for small to medium strings (not just tiny), like more than 8 or 16 bytes, use a bithack that checks whole words at once for maybe having a 0 byte.
Why does glibc's strlen need to be so complicated to run quickly?