RISC-V assembly - stack layout - function call

RISC-V assembly - stack layout - function call - c

Currently I am working with a RISC-V processor implementation. I need to run partially hand-crafted assembly code. (Finally there will be dynamic code injection.) For this purpose I have to understand the basics of function calls within RISC-V assembly.
I found this topic very helpful: confusion about function call stack
But I am still struggling with the stack layout for a function call. Please consider the following c-code:
void some_func(int a, int b, int* c){
int cnt = a;
for(;cnt > 0;cnt--){
*c += b;
}
}
void main(){
int a = 5;
int b = 6;
int c = 0;
some_func(a,b,&c);
}
This program implements a basic multiplication by a sequence of additions. The derived assembly code (riscv64-unknown-elf-gcc -nostartfiles mul.c -o mul && riscv64-unknown-elf-objdump -D mul) looks like this:
0000000000010000 <some_func>:
10000: fd010113 addi sp,sp,-48
10004: 02813423 sd s0,40(sp)
10008: 03010413 addi s0,sp,48
1000c: fca42e23 sw a0,-36(s0)
10010: fcb42c23 sw a1,-40(s0)
10014: fcc43823 sd a2,-48(s0)
10018: fdc42783 lw a5,-36(s0)
1001c: fef42623 sw a5,-20(s0)
10020: 0280006f j 10048 <some_func+0x48>
10024: fd043783 ld a5,-48(s0)
10028: 0007a703 lw a4,0(a5)
1002c: fd842783 lw a5,-40(s0)
10030: 00f7073b addw a4,a4,a5
10034: fd043783 ld a5,-48(s0)
10038: 00e7a023 sw a4,0(a5)
1003c: fec42783 lw a5,-20(s0)
10040: fff7879b addiw a5,a5,-1
10044: fef42623 sw a5,-20(s0)
10048: fec42783 lw a5,-20(s0)
1004c: fcf04ce3 bgtz a5,10024 <some_func+0x24>
10050: 00000013 nop
10054: 02813403 ld s0,40(sp)
10058: 03010113 addi sp,sp,48
1005c: 00008067 ret
0000000000010060 <main>:
10060: fe010113 addi sp,sp,-32
10064: 00113c23 sd ra,24(sp)
10068: 00813823 sd s0,16(sp)
1006c: 02010413 addi s0,sp,32
10070: 00500793 li a5,5
10074: fef42623 sw a5,-20(s0)
10078: 00600793 li a5,6
1007c: fef42423 sw a5,-24(s0)
10080: fe042223 sw zero,-28(s0)
10084: fe440793 addi a5,s0,-28
10088: 00078613 mv a2,a5
1008c: fe842583 lw a1,-24(s0)
10090: fec42503 lw a0,-20(s0)
10094: f6dff0ef jal 10000 <some_func>
10098: 00000013 nop
1009c: 01813083 ld ra,24(sp)
100a0: 01013403 ld s0,16(sp)
100a4: 02010113 addi sp,sp,32
100a8: 00008067 ret
The important steps that need clarification are: (some_func(int,int,int))
10060: fe010113 addi sp,sp,-32
10064: 00113c23 sd ra,24(sp)
10068: 00813823 sd s0,16(sp)
1006c: 02010413 addi s0,sp,32
and: (main())
10000: fd010113 addi sp,sp,-48
10004: 02813423 sd s0,40(sp)
10008: 03010413 addi s0,sp,48
From my understanding: The stack pointer is moved to make space for return-address and parameters. (main might be a special case here.) How are the passed arguments treated when located on the stack? How are they obtained back? In general, the methodology is clear to me, but how would I hand-code this segment in order to work.
Regarding the related topic, the stack should look somewhat like
| ??? |
| params for some_func() <???> |
| ra of some_func() |
| locals of main() <int c> |
| locals of main() <int b> |
| locals of main() <int a> |
| params for main() <None> |
But that is pretty much it. Can anybody point out, how this is arranged, and how these two listings (function call) co-related?

What you want to know is specified by the RISC-V calling conventions.
Main points:
Function arguments are usually passed in the a0 to a7 registers, not on the stack. An argument is only passed via the stack if there is no room in the a* registers left.
Some registers are caller saved while others are callee saved (cf. Table 26.1, Chapter 26 RISC-V Assembly Programmer’s Handbook, in the RISC-V base specification, 2019-06-08 ratified). That means before calling a function the caller has to save all caller saved registers to the stack if it wants to retain their content. Similarly, the called function has to save all callee saved registers to the stack if it wants to use them for its own purposes.

The first few parameters, type permitting, are passed in registers so they don't even appear on the stack. Other than that, it's unclear what you really want to know. If you do get some arguments that are on the stack, they stay there even after you adjust the stack pointer so you can still address them relative to the adjusted stack pointer or a frame pointer (here $s0 apparently).
The important steps, that need clarification are:
10060: fe010113 addi sp,sp,-32 # allocate space
10064: 00113c23 sd ra,24(sp) # save $ra
10068: 00813823 sd s0,16(sp) # save $s0
1006c: 02010413 addi s0,sp,32 # set up $s0 as frame pointer

Related

What %lo(source)($6) and .frame mean in assembly code?

I assemble a simple c program to mips and try to understand the assembly code. By comparing with c code, I almost understand the it but still get some problems.
I use mips-gcc to generate assembly code: $ mips-gcc -S -O2 -fno-delayed-branch -I/usr/include lab3_ex3.c -o lab3_ex3.s
Here is my guess about how the assembly code works:
main is the entry of the program.
$6 is the address of source array.
$7 is the address of dest array.
$3 is the size of source array.
$2 is the variable k and is initialized to 0.
$L3 is the loop
$5 and $4 are addresses of source[k] and dest[k].
sw $3,0($5) is equivalent to store source[k] in $3.
lw $3,4($4) is equivalent to assign source[k] to dest[k].
addiu $2,$2,4 is equivalent to k++.
bne $3, $0, $L3 means that if source[k] is zero then exits the loop otherwise jump to lable $L3.
$L2 just do some clean up work.
Set $2 to zero.
Jump to $31 (return address).
My problems is:
What .frame $sp,0,$31 does?
Why lw $3,4($4) instead of lw $3,0($4)
What is the notation%lo(source)($6) means? ($hi and $lo$ registers are used in multiply so why they are used here?)
Thanks.
C
int source[] = {3, 1, 4, 1, 5, 9, 0};
int dest[10];
int main ( ) {
int k;
for (k=0; source[k]!=0; k++) {
dest[k] = source[k];
}
return 0;
}
Assembly
.file 1 "lab3_ex3.c"
.section .mdebug.eabi32
.previous
.section .gcc_compiled_long32
.previous
.gnu_attribute 4, 1
.text
.align 2
.globl main
.set nomips16
.ent main
.type main, #function
main:
.frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0
.mask 0x00000000,0
.fmask 0x00000000,0
lui $6,%hi(source)
lw $3,%lo(source)($6)
beq $3,$0,$L2
lui $7,%hi(dest)
addiu $7,$7,%lo(dest)
addiu $6,$6,%lo(source)
move $2,$0
$L3:
addu $5,$7,$2
addu $4,$6,$2
sw $3,0($5)
lw $3,4($4)
addiu $2,$2,4
bne $3,$0,$L3
$L2:
move $2,$0
j $31
.end main
.size main, .-main
.globl source
.data
.align 2
.type source, #object
.size source, 28
source:
.word 3
.word 1
.word 4
.word 1
.word 5
.word 9
.word 0
.comm dest,40,4
.ident "GCC: (GNU) 4.4.1"

Firstly, main, $L3 and $L2 are labels for 3 basic blocks. You are roughly correct about their functions.
Question 1: What is .frame doing
This is not a MIPS instruction. It is metadata describing the (stack) frame for this function:
The stack is pointed to by $sp, an alias for $29.
and the size of the stack frame (0, since the function has neither local variables, nor arguments on the stack). Further, the function is simple enough that it can work with scratch registers and does not need to save callee-saved registers $16-$23.
the old return address ($31 for MIPS calling convention)
For more information regarding the MIPS calling convention, see this doc.
Question 2: Why lw $3,4($4) instead of lw $3,0($4)
This is due to an optimization of the loop. Normally, the sequence of loads and stores would be :
load source[0]
store dest[0]
load source[1]
store dest[1]
....
You assume that the loop is entirely in $L3, and that contains load source[k] and store dest[k]. It isn't. There are two clues to see this:
There is a load in the block main which does not correspond to any load outside the loop
Within the basic block $L3, the store is before the load.
In fact, load source[0] is performed in the basic-block named main. Then, the loop in the basic block $L3 is store dest[k];load source[k+1];. Therefore, the load uses an offset of 4 more than the offset of the store, because it is loading the integer for the next iteration.
Question 3: What is the lo/hi syntax?
This has to do with instruction encodings and pointers. Let us assume a 32-bit architecture, i.e. a pointer is 32 bits. Like most fixed-size instruction ISAs, let us assume that the instruction size is also 32 bits.
Before loading and storing from the source/dest arrays, you need to load their pointers into registers $6 and $7 respectively. Therefore, you need an instruction to load a 32-bit constant address into a register. However, a 32-bit instruction must contain a few bits to encode opcodes (which operation the instruction is), destination register etc. Therefore, an instruction has less than 32 bits left to encode constants (called immediates). Therefore, you need two instructions to load a 32-bit constant into a register, each loading 16 bits. The lo/hi refer to which half of the constant is loaded.
Example: Assume that dest is at address 0xabcd1234. There are two instructions to load this value into $7.
lui $7,%hi(dest)
addiu $7,$7,%lo(dest)
lui is Load Upper immediate. It loads the top 16 bits of the address of dest (0xabcd) into the top 16 bits of $7. Now, the value of $7 is 0xabcd0000.
addiu is Add Immediate Unsigned. It adds the lower 16 bits of the address of dest (0x1234) with the existing value in $7 to get the new value of $7. Thus, $7 now holds 0xabcd0000 + 0x1234 = 0xabcd1234, the address of dest.
Similarly, lw $3,%lo(source)($6) loads from the address pointed to by $6 (which already holds the top 16 bits of the address of source) at an offset of %lo(source) (the bottom 16 bits of that address). Effectively, it loads the first word of source.

Finding factorial of a number using recursive call in MIPS programming

This is the C source code
#include <stdio.h>
int main() {
printf("The Factorial of 10 is %d\n", fact(10));
}
int fact(int n) {
if (n < 1)
return (1);
else
return (n * fact(n - 1));
}
I am converting a C Programming function to a MIPS, but when I run the MIPS program I am getting an error for the .ascii section.
.text
.globl main
main:
subu $sp,$sp,32 # Stack frame is 32 bytes long
sw $ra,20($sp) # Save return address
sw $fp,16($sp) # Save old frame pointer
addiu $fp,$sp,28 # Set up frame pointer
li $a0,10 # Put argument (10) in $a0
jal fact # Call factorial function
la $a0,$LC # Put format string in $a0
move $a1,$v0 # Move fact result to $a1
jal printf # Call the print function
lw $ra,20($sp) # Restore return address
lw $fp,16($sp) # Restore frame pointer
addiu $sp,$sp,32 # Pop stack frame
jr $ra # Return to caller
.rdata
$LC:
.ascii “The factorial of 10 is %d\n\000”
.text
fact:
subu $sp,$sp,32 # Stack frame is 32 bytes long
sw $ra,20($sp) # Save return address
sw $fp,16($sp) # Save frame pointer
addiu $fp,$sp,28 # Set up frame pointer
sw $a0,0($fp) # Save argument (n) to use for Recursive Call
lw $v0,0($fp) # Load n
bgtz $v0,$L2 # Branch if n > 0
li $v0,1 # Return 1
jr $L1 # Jump to code to return
$L2:
lw $v1,0($fp) # Load n
subu $v0,$v1,1 # Compute n - 1
move $a0,$v0 # Move value to $a0
jal fact # Call factorial function
lw $v1,0($fp) # Load n
mul $v0,$v0,$v1 # Compute fact(n-1) * n
$L1: # Result is in $v0
lw $ra, 20($sp) # Restore $ra
lw $fp, 16($sp) # Restore $fp
addiu $sp, $sp, 32 # Pop stack
jr $ra # Return to caller
It's giving me an error for the .ascii code section saying it shouldn't be in the .text:
Error in ".ascii" directive cannot appear in text segment
It's also saying that:
"$L1": operand is of incorrect type

It's giving me an error for the .ascii code section saying it shouldn't be in the .text:
Error in ".ascii" directive cannot appear in text segment"
I am going out on a limb here because I am not 100% sure what you are running this on, but some sims like MARS don't recognize the rdata segment. You can try using just .data.
Also, if you are on something like WinMIPS64, you may want to try placing the .data segment at the top of the code. I understand what you are doing is right in some environments and but doesn't work in others, so give it a whirl.
May I suggest you try these things separately, just in case.

How do I transfer control after a loop in assembly?

I would like to translate the C code below into assembly language.
However, I do not see that I need to use the stack in this example.
Moreover, I'd like to know whether or not "beq" saves the address of the following instruction in $ra like "jal" does, for when the loop ends, I would like to get back to the original function foo, and continue the instructions (which here is simply returning.)
int foo(int* a, int N) {
if(N > 0)
{
for(int i = 0; i != N; i = i + 1)
{
a[i] = bar(i << 4, a[i]);
}
}
return N & 7;
}
#assume *a in $a0, N $N in $a1
foo:
slt $t0, $zero, $a1 #put 1 in $t0 if 0 < N
li $t1,0 # use $t1 as loop counter
beq $t0, 1, loop # enter loop if 0 < N
and $v0, $a1, 7 # do bitwise and on N and 7 and save in $v0 as return value
loop:
beq $t1, $a1, exit # exit loop when i = N
sll $t3, $t1, 2 # obtain 4 * i
add $t3, $a1, $t3 # obtain address of a[i] which is address of a plus 4i
lw $t3, o($t3) # load a[i] into $t3
sll $t4, $t1, 4 #perform i<< 4 and save in $t4
# the 2 previous load arguments for bar
jal bar # assume bar saves return value in $v2
sw $t3, 0($v1)
j loop
exit:
and $v0, $a1, 7

beq is for conditional branching, not calling — it changes the PC (conditionally) but not $ra. We use it to translate structured statements (e.g. if, for) into the if-goto style of assembly language.
However, I do not see that I need to use the stack in this example.
You must to use the stack for this code because the call to bar (as in jal bar) will wipe out foos $ra, and while bar will be able to return back to foo, foo will not be able to return to its caller. Since this requires a stack, you will need prologue and epilogue to allocate and release some stack space.
Your code is not properly passing parameters to bar, i << 4, for example, should be passed in $a0, while a[i] should be passed in $a1.
You do not have a return instruction in foo — it is missing a jr $ra.

If either of your beq instructions did set $ra, those wouldn't be useful points to return back to. But since you asked:
I'd like to know whether or not "beq" saves the address of the following instruction in $ra like "jal" does
If the instruction mnemonic doesn't end with al (which stands for And Link), it doesn't save a return address in $ra.
Classic MIPS has the following instructions that link, from this somewhat incomplete reference (missing nor and IDK what else).
jal target (Jump And Link)
BGEZAL $reg, target (conditional Branch if >= 0 And Link)
BLTZAL $reg, target (conditional Branch if < 0 And Link)
Note that the conditional branches are effectively branching on the sign bit of the register.
bal is an alias for bgezal $zero, target, useful for doing a PC-relative function call. (MIPS branches use a fully relative encoding for branch displacement, MIPS jumps use a region-absolute encoding that replaces the low 28 bits of PC+4. This matters for position-independent code).
None of this is particularly relevant to your case; your foo needs to save/restore $ra on entry/before jr $ra because you need to call bar with a jal or bal. Using a linking branch as the loop branch wouldn't affect anything (except to make your code even less efficient, and make performance worse on real CPUs that do return-address prediction with a special predictor that assumes jal and jr $ra are paired properly).
Using bal / jal doesn't automatically make the thing you jump to ever return; that only happens if the target ever uses jr $ra (potentially after copying $ra somewhere else then restoring it).

C Programming to MIPS Assembly (for Loops)

I'm Trying to convert this C code to MIPS assembly and I am unsure if it is correct. Can someone help me? Please
Question : Assume that the values of a, b, i, and j are in registers $s0, $s1, $t0, and $t1, respectively. Also, assume that register $s2 holds the base address of the array D
C Code :
for(i=0; i<a; i++)
for(j=0; j<b; j++)
D[4*j] = i + j;
My Attempt at MIPS ASSEMBLY
add $t0, $t0, $zero # i = 0
add $t1, $t1, $zero # j = 0
L1 : slt $t2, $t0, $s0 # i<a
beq $t2, $zero, EXIT # if $t2 == 0, Exit
add $t1, $zero, $zero # j=0
addi $t0, $t0, 1 # i ++
L2 : slt $t3, $t1, $s1 # j<b
beq $t3, $zero, L1, # if $t3 == 0, goto L1
add $t4, $t0, $t1 # $t4 = i+j
muli $t5, $t1, 4 # $t5 = $t1 * 4
sll $t5, $t5, 2 # $t5 << 2
add $t5, $t5, $s2 # D + $t5
sw $t4, $t5($s2) # store word $t4 in addr $t5(D)
addi $t0, $t1, 1 # j ++
j L2 # goto L2
EXIT :

add $t0, $t0, $zero # i = 0 Nope, that leaves $t0 unmodified, holding whatever garbage it did before. Perhaps you meant to use addi $t0, $zero, 0?
Also, MIPS doesn't have 2-register addressing modes (for integer load/store), only 16-bit-constant ($reg). $t5($s2) isn't legal. You need a separate addu instruction, or better a pointer-increment.
(You should use addu instead of add for pointer math; it's not an error if address calculation crosses from the low half to high half of address space.)
In C, it's undefined behaviour for another thread to be reading an object while you're writing it, so we can optimize away the actual looping of the outer loop. Unless the type of D is _Atomic int *D or volatile int *D, but that isn't specified in the question.
The inner loop writes the same elements every time regardless of the outer loop counter, so we can optimize away the outer loop and only do the final outer iteration, with i = a-1. Unless a <= 0, then we must skip the outer loop body, i.e. do nothing.
Optimizing away all but the last store to every location is called "dead store elimination". The stores in earlier outer-loop iterations are "dead" because they're overwritten with nothing reading their value.
You normally want to put the loop condition at the bottom of the loop, so the loop branch is a bne $t0, $t1, top_of_loop for example. (MIPS has bne as a native hardware instruction; blt is only a pseudo-instruction unless the 2nd register is $zero.) So we want to optimize j<b to j!=b because we know we're counting upward.
Put a conditional branch before the loop to check if it might need to run zero times. e.g. blez $s0, after_loop to skip the inner loop body if b <= 0.
An idiomatic for(i=0 ; i<a ; i++) loop in asm looks like this in C (or some variation on this).
if(a<=0) goto end_of_loop;
int i=0;
do{ ... }while(++i != a);
Or if i isn't used inside the loop, then i=a and do{}while(--i). (i.e. add -1 and use bnez). Although MIPS can branch just as efficiently on i!=a as it can on i!=0, unlike most architectures with a FLAGS register where counting down saves a compare instruction.
D[4*j] means we stride by 16 bytes in a word array. Separately using a multiply by 4 and a shift by 2 is crazy redundant. Just keep a pointer in a separate register an increment it by 16 every iteration, like a C compiler would.
We don't know the type of D, or any of the other variables for that matter. If any of them are narrow unsigned integers, we might need to implement 8 or 16-bit truncation/wrapping.
But your implementation assumes they're all int or unsigned, so let's do that.
I'm assuming a MIPS without branch-delay slots, like MARS simulates by default.
i+j starts out (with j=0) as a-1 on the last outer-loop iteration that sets the final value. It runs up to j=b-1, so the max value is a-1 + b-1.
Simplifying the problem down to the values we need to store, and the locations we need to store them in, before writing any asm, means the asm we do write is a lot simpler and easier to debug.
You could check the validity of most of these transformations by doing them in C source and checking with a unit test in C.
# int a: $s0
# int b: $s1
# int *D: $s2
# Pointer to D[4*j] : $t0
# int i+j : $t1
# int a-1 + b : $t2 loop bound
blez $s0, EXIT # if(a<=0) goto EXIT
blez $s1, EXIT # if(b<=0) goto EXIT
# now we know both a and b loops run at least once so there's work to do
addiu $t1, $s0, -1 # tmp = a-1 // addu because the C source doesn't do this operation, so we must not fault on signed overflow here. Although that's impossible because we already excluded negatives
addu $t2, $t1, $s1 # tmp_end = a-1 + b // one past the max we store
add $t0, $s2, $zero # p = D // to avoid destroying the D pointer? Otherwise increment it.
inner: # do {
sw $t1, ($t0) # tmp = i+j
addiu $t1, $t1, 1 # tmp++;
addiu $t0, $t0, 16 # 4*sizeof(*D) # could go in the branch-delay slot
bne $t1, $t2, inner # }while(tmp != tmp_end)
EXIT:
We could have done the increment first, before the store, and used a-2 and a+b-2 as the initializer for tmp and tmp_end. On some real pipelined/superscalar MIPS CPUs, that might be better to avoid putting the increment right before the bne that reads it. (After moving the pointer-increment into the branch-delay slot). Of course you'd actually unroll to save work, e.g. using sw $t1, 16($t0) and 32($t0) / 48($t0).
Again on a real MIPS with branch delays, you'd move some of the init of $t0..2 to fill the branch delay slots from the early-out blez instructions, because they couldn't be adjacent.
So as you can see, your version was over-complicated to say the least. Nothing in the question said we have to transliterate each C expression to asm separately, and the whole point of C is the "as-if" rule that allows optimizations like this.

This similar C code compiles and translates to MIPS:
#include <stdio.h>
main()
{
int a,b,i,j=5;
int D[50];
for(i=0; i<a; i++)
for(j=0; j<b; j++)
D[4*j] = i + j;
}
Result:
.file 1 "Ccode.c"
# -G value = 8, Cpu = 3000, ISA = 1
# GNU C version cygnus-2.7.2-970404 (mips-mips-ecoff) compiled by GNU C version cygnus-2.7.2-970404.
# options passed: -msoft-float
# options enabled: -fpeephole -ffunction-cse -fkeep-static-consts
# -fpcc-struct-return -fcommon -fverbose-asm -fgnu-linker -msoft-float
# -meb -mcpu=3000
gcc2_compiled.:
__gnu_compiled_c:
.text
.align 2
.globl main
.ent main
main:
.frame $fp,240,$31 # vars= 216, regs= 2/0, args= 16, extra= 0
.mask 0xc0000000,-4
.fmask 0x00000000,0
subu $sp,$sp,240
sw $31,236($sp)
sw $fp,232($sp)
move $fp,$sp
jal __main
li $2,5 # 0x00000005
sw $2,28($fp)
sw $0,24($fp)
$L2:
lw $2,24($fp)
lw $3,16($fp)
slt $2,$2,$3
bne $2,$0,$L5
j $L3
$L5:
.set noreorder
nop
.set reorder
sw $0,28($fp)
$L6:
lw $2,28($fp)
lw $3,20($fp)
slt $2,$2,$3
bne $2,$0,$L9
j $L4
$L9:
lw $2,28($fp)
move $3,$2
sll $2,$3,4
addu $4,$fp,16
addu $3,$2,$4
addu $2,$3,16
lw $3,24($fp)
lw $4,28($fp)
addu $3,$3,$4
sw $3,0($2)
$L8:
lw $2,28($fp)
addu $3,$2,1
sw $3,28($fp)
j $L6
$L7:
$L4:
lw $2,24($fp)
addu $3,$2,1
sw $3,24($fp)
j $L2
$L3:
$L1:
move $sp,$fp # sp not trusted here
lw $31,236($sp)
lw $fp,232($sp)
addu $sp,$sp,240
j $31
.end main

Crosscompiling C to MIPS64 and simulating

I needed to translate the follwing C code to MIPS64:
#include <stdio.h>
int main() {
int x;
for (x=0;x<10;x++) {
}
return 0;
}
I used codebench to crosscompile this code to MIPS64. The following code was created:
.file 1 "loop.c"
.section .mdebug.abi32
.previous
.gnu_attribute 4, 1
.abicalls
.option pic0
.text
.align 2
.globl main
.set nomips16
.set nomicromips
.ent main
.type main, #function
main:
.frame $fp,24,$31 # vars= 8, regs= 1/0, args= 0, gp= 8
.mask 0x40000000,-4
.fmask 0x00000000,0
.set noreorder
.set nomacro
addiu $sp,$sp,-24
sw $fp,20($sp)
move $fp,$sp
sw $0,8($fp)
j $L2
nop
$L3:
lw $2,8($fp)
addiu $2,$2,1
sw $2,8($fp)
$L2:
lw $2,8($fp)
slt $2,$2,10
bne $2,$0,$L3
nop
move $2,$0
move $sp,$fp
lw $fp,20($sp)
addiu $sp,$sp,24
j $31
nop
.set macro
.set reorder
.end main
.size main, .-main
.ident "GCC: (Sourcery CodeBench 2012.03-81) 4.6.3"
To check if the code works as expected, I usually use the WINMIPS64 simulator. For one or other reason this simulator does not want to accept this code. It appears that every line of code is wrong. I have been looking at this issue for over a day. I hope someone can help me out with this. What is wrong with this assembly code for the mips64 architecture?

From page 7 of the WINMIPS64 documentation:
The following assembler directives are supported
.data - start of data segment
.text - start of code segment
.code - start of code segment (same as .text)
.org <n> - start address
.space <n> - leave n empty bytes
.asciiz <s> - enters zero terminated ascii string
.ascii <s> - enter ascii string
.align <n> - align to n-byte boundary
.word <n1>,<n2>.. - enters word(s) of data (64-bits)
.byte <n1>,<n2>.. - enter bytes
.word32 <n1>,<n2>.. - enters 32 bit number(s)
.word16 <n1>,<n2>.. - enters 16 bit number(s)
.double <n1>,<n2>.. - enters floating-point number(s)
Get rid of everything that's not in the above list, as it won't run in the simulator.
You'll need to move the .align to before .text
WINMIPS64 expects daddi/daddui instead of addi/addiu, again as per the documentation.
As per the documentation, move $a, $b is not a supported mnemonic. Replace them with daddui $a, $b, 0 instead.
slt needs to be slti.
Finally, the simulator expects an absolute address for j, but you've given it a register. Use jr instead.
At this point I get an infinite loop. This is because the stack pointer doesn't get initialized. The simulator only gives you 0x400 bytes of memory, so go ahead and initialize the stack to 0x400:
.text
daddui $sp,$0,0x400
Now it runs. Since you're running the code by itself, nothing will be in the return register and the final jr $31 will just bring it back to the beginning.
Here's my version:
.align 2
.text
daddui $sp,$0,0x400
main:
daddui $sp,$sp,-24
sw $fp,20($sp)
daddui $fp,$sp,0
sw $0,8($fp)
j $L2
nop
$L3:
lw $2,8($fp)
daddui $2,$2,1
sw $2,8($fp)
$L2:
lw $2,8($fp)
slti $2,$2,10
bne $2,$0,$L3
nop
daddui $2,$0,0
daddui $sp,$fp,0
lw $fp,20($sp)
daddui $sp,$sp,24
jr $31
nop
Consider getting either another compiler or another simulator, because these two clearly hate each other.