I am currently trying to optimize some MIPS assembler that I've written for a program that triangulates a 24x24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.
Part of the algorithm involves the following loop that I'm trying to unroll (N will always be 24)
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j++) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
}
...
I want
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j +=2) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
A[k][j + 1] = A[k][j + 1] * inv;
}
...
but it generates the incorrect result and I don't know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.
Here is my attempt at doing it.
...
# No loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 1 ## j += 2
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f0, 0($v0) ## Perform A[k][j] := A[k][j] * inv
# The matrix triangulates without problem with this original code.
...
...
# One loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 2 ## j += 2
lwc1 $f1, 4($v0) # $f1 <- A[k][j + 1]
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
mul.s $f1, $f1, $f2 # Perform A[k][j+1] * inv
swc1 $f0, 0($v0) # Perform A[k][j] := A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f1, 4($v0) ## Perform A[k][j + 1] := A[k][j + 1] * inv
# The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.
...
The unrolled C loop condition is buggy.
j < N; j +=2 can start the loop body with j = N-1,
accessing the array at A[k][N-1] (fine) and A[k][N] (not fine).
One common method is j < N-1, or in general j < N-(unroll-1). But for unsigned N, you also have to separately check N >= unroll before starting the loop, because N-1 could wrap to a huge unsigned value.
Keeping the j < limit is generally good for C compilers vs. j + 1 < N which is a separate thing they'd have to calculate. And can also stop a compiler from proving that the loop isn't infinite for unsigned counts (like size_t), because that's well-defined as wrapping around, so N = UINT_MAX could lead to the condition always being true depending on the starting point. (e.g. j = UINT_MAX-2 makes UINT_MAX-1 < UINT_MAX, and j+=2 makes 0 < UINT_MAX, also true.) So it's a similar problem to using j <= limit for unsigned counters. Compilers really like to know when a loop is potentially infinite. For some, that it disables auto-vectorization if the trip-count isn't calculable ahead of the first iteration.
If j was starting at 0, you can get away with a sloppy condition if N was guaranteed to be a multiple of the unroll factor. But here it's different, as Nate points out.
efficiency of your MIPS asm
generally the point of loop unrolling is performance. A non-inline call to a helper function inside the loop is kind of defeating the purpose.
jal getelem I assume does a bunch of multiplies and stuff to redo the indexing with a pointer and two integers? Notice that you're scanning along contiguous memory in one row, so you can just increment a pointer.
Calculate an end-pointer to compare against, so your MIPS loop can look like
# some checking outside the loop, maybe with a bxx to the end of it.
looptop: # do{
lwc1 $f2, 0($t0)
lwc1 $f3, 4($t0)
addiu $t0, $t0, 4*2 # p+=2 advance by 8 bytes, 2 floats
...
swc1 something, 0($t0)
swc1 something, 4($t0)
bne $t0, $t1 # }while(p!=endp)
# maybe another condition to check if you should run one last iteration.
MIPS bltu is only a pseudo-instruction (sltu/bnez); that's why it's better to calculate an exact end-pointer so you can use a single machine instruction as the loop branch.
And yes, this might mean rounding the iteration count down to a multiple of 2 to ensure correctness. Or doing a scalar iteration and rounding up to a multiple of 2. e.g. x++ / x&=-2;
With software pipelining, e.g. doing a load and divide but not a store yet, you could maybe let the rounding-up have the loop redo that element if odd. (If the chance of a branch mispredict costs more than an FP multiply and a redundant store.) Haven't fully thought this through, but it's a similar idea to SIMD doing a first unaligned vector, then a potentially-partially-overlapping aligned vector. (SIMD vectorization is like unrolling, but then you roll back up into a single instruction that does 4 elements, for example.)
Related
I started to read MIPS to understand better how my C++ and C code works under the computer skin. I started with a recursive function, a Fibonacci function.
The C code is:
int fib(int n) {
if(n == 0) { return 0; }
if(n == 1) { return 1; }
return (fib(n - 1) + fib(n - 2));
}
MIPS code:
fib:
addi $sp, $sp, -12
sw $ra, 8($sp)
sw $s0, 4($sp)
addi $v0, $zero, $zero
beq $a0, $zero, end
addiu $v0, $zero, 1
addiu $t0, $zero, 1
beq $a0, $t0, end
addiu $a0, $a0, -1
sw $a0, 0($sp)
jal fib #fib(n-1)
addi $s0, $v0, $zero
lw $a0, 0($sp)
addiu $a0, $a0, -1
jal fib #fib(n-2)
add $v0, $v0, $s0
end:
lw $s0, 4($sp)
lw $ra, 8($sp)
addi $sp, $sp, 12
jr $ra
When n>1 it goes until the code reaches the first jal instruction. What happens next? it return to fib label ignoring the code below (the fib(n-2) call will never be executed?)? If that happens, the $sp pointer decreases 3 words again and the cycle will go until n<=1. I can't understand how this works when first jal instruction is reached.
Can you follow how the recursion works in C?
In some sense, recursion has two components: the forward part and the backward part. In the forward part, a recursive algorithm computes things before the recursion, and in the backward part, a recursive algorithm computes things after the recursion completes. In between the two parts, there is the recursion.
See this answer: https://stackoverflow.com/a/71551098/471129
Fibonacci is just slightly more complicated as it performs recursion twice, not just once as in the above list printing example.
However, the principles are the same: There is work done before the recursion, and work done after (either of which can be degenerate). The before part happens as code in front of the recursion executes, and the recursion builds up stack frames that are placeholders for work after the recursion yet to be completed. The after part happens as the stack frames are released and the code after the recursive call is executed.
In any given call chain, the forward part goes until n is 0 or 1, then the algorithm starts returning back to the stacked callers, for whom the backward part kicks in unwinding stack frames until it returns to the original caller (perhaps main) rather than to some recursive fib caller.&npsp; Again, complicated by use of two recursive invocations rather than one as in simpler examples.
With fib, the work done before is to count down (by -1 or -2) until reaching 0 or 1. The work done after the recursion is to sum the two prior results. The recursion itself effectively suspends an invocation or activation of fib with current values, to be resumed when a recursive call completes.
Recursion in MIPS algorithm is the same; however, function operations are spread out over several machine code instructions that are implicit in C.
Suggest single stepping over a call to fib(2) as a very small example that may help you see what's going on there. Suggest first doing this in C — single step until the outer fib call has full completed and returned to the calling test function (e.g. main).
To make the C version just a bit easier to view in the debugger you might use this version:
int fib(int n) {
if (n == 0) { return 0; }
if (n == 1) { return 1; }
int fm1 = fib(n-1);
int fm2 = fib(n-2);
int result = fm1 + fm2;
return result;
}
With that equivalent C version, you'll be able to inspect fm1, fm2, and result during single stepping. That will make it easier to follow.
Next, do the same in the assembly version. Debug single step to watch execution of fib(2), and draw parallels with the equivalents in C.
There's another way to think about recursion, which is ignore the recursion, pretending that the recursive call is to some unrelated function implementation that just happens to yield the proper results of the recursive function; here's such a non-recursive function:
int fib(int n) {
if (n == 0) { return 0; }
if (n == 1) { return 1; }
int fm1 = fibX(n-1); // calls something else that computes fib(n-1)
int fm2 = fibX(n-2); // "
int result = fm1 + fm2;
return result;
}
With this code, and the assumption that fibX simply works correctly to return proper results, you can focus strictly on the logic of one level, namely, the body of this fib, without considering the recursion at all.
Note that we can do the same in assembly language — though the opportunities for errors / typos are always much larger than in the C, since you still have to manipulate stack frames and preserve critical storage for later use after the calling.
The code you've posted has a transcription error, making it different from the C version. It is doing the C equivalent of:
return fib(n-1) + fib(n-1);
I am trying to create a program where I can store up to 8 values in an array and then compare all these values to find the smallest number. For some reason my loop overwrites the first position in the array every time. Here's what I have. I then add 4 to $t1 on the loop so once it goes back around it should store the next integer in the space after that. I don't see what I'm doing wrong here?
.data
myArray: .space 32
Msg1: .asciiz "Enter an integer: "
.text
main:
# Print Message
li $v0, 4
la $a0, Msg1
syscall
# Prompt the user to enter an integer
li $v0, 5
syscall
# Store the first integer in $t0
move $t0, $v0
# Declare $t1 for the array position that the integer will be stored at
addi $t1, $zero, 0
# Store the integer in the array
sw $t0, myArray($t1)
#Add 4 to $t1 so store the next value in the next array position
addi $t1, $zero, 4
beq $t0, $zero, Exit
j main
Exit:
# Declare an exit to the program
li $v0, 10
syscall
First, let's start with a working algorithm in C:
int a [] = { /* array elements */ };
int n = /* count of number of elements */;
...
int currMin = 0;
for ( int i = 0; i < n; i++ ) {
int next = a[i]; // next array element to check
if ( next < currMin ) // is it smaller than what we've seen so far?
currMin = next; // yes: capture new min value
}
// on exit from the loop currMin holds the min value
Ok, now we'll make some simple logical transformations on the way to taking this to assembly language. First, we remove the for loop in favor of the slightly simpler while loop construct.
int currMin = 0;
int i = 0;
while ( i < n ) {
int next = a[i]; // next array element to check
if ( next < currMin ) // is it smaller than what we've seen so far?
currMin = next; // yes: capture new min value
i++;
}
// on exit from the loop currMin holds the min value
Next, we'll transform the while loop into assembly's if-goto-label. (We could work the if-then first instead; the order we transform doesn't matter.)
int currMin = 0;
int i = 0;
loop1:
if ( i >= n ) goto endLoop1;
int next = a[i]; // next array element to check
if ( next < currMin ) // is it smaller than what we've seen so far?
currMin = next; // yes: capture new min value
i++;
goto loop1;
endLoop1:
// on exit from the loop currMin holds the min value
Next, we'll do the if-then statement. We could have done it first, that wouldn't change the analysis or results.
int currMin = 0;
int i = 0;
loop1:
if ( i >= n ) goto endLoop1;
int next = a[i]; // next array element to check
if ( next >= currMin ) goto endIf1;
currMin = next; // yes: capture new min value
endIf1:
i++;
goto loop1;
endLoop1:
Next we'll take this to assembly language.
First, assign variables to physical storage, here good choice is registers. Mental map:
$a0 array a
$a1 element count n
$v0 currMin, the result
$t0 loop control variable i, also used as index
$t1 temporary variable "next"
Second, translate code as per the last above:
li $v0, 0 # currMin = 0
li $t0, 0 # i = 0
loop1:
bge $t0, $a1, endLoop1
# array reference, variable index: a[i], capture in "next"/$t1
sll $t9, $t0, 2
add $t9, $a0, $t9
lw $t1, 0($t9)
# the if-then inside the loop body
bge $t1, $v0, endIf1
move $v0, $t1 # capture newly seen lowest value
endIf1:
# finish the rest of the while loop, having the for-loop i++ here
addi $t0, $t0, 1
j loop1
endLoop1:
All that's left is to put some starting and ending code around that, assuming the register numbers match up.
The starting code for this would put the address of the array into $a0, and the count of elements into $a1. Could use different registers of course, with appropriate modifications.
The ending code should expect the result in $v0, to print or otherwise.
The starting code would be entirely before this code, and the ending code entirely after.
Yes, there are a few steps — but each one is a relatively simple and logical transformation. Logical transformations enable translating the C code, first staying in C but simplifying to make it easy to go right to assembly.
Remove for loops in favor of while loops.
Change all control structures into if-goto-label
Each control structure (if, while) can be changed one at a time, stay in C, and keep checking that the code continues to work in C! Order of control structure transformations doesn't matter (inside out, outside in).
Translate simplified C code into assembly:
a. Map logical variables of C into physical storage of machine code
b. Convert statements & expressions from C into assembly language
I am confused on how to convert C code to MIPS. I seem to to get the loops confused and I think I am possibly using the wrong command. The C code I made to do this is as follows:
int main()
{
int x, y;
int sum = 0;
printf("Please enter values for X and Y:\n ");
scanf("%d %d",&x,&y);
if (x > y)
{
printf("\n** Error");
exit(0);
}
while (x <= y)
{
if (x%2 == 0)
sum += x;
x++;
}
printf("\nThe sum of the even integers between X and Y is: %d\n\n",sum);
return 0;
}
My attempt at the MIPS translation is as follows:
.data
Prompt: .asciiz "Please enter values for X and Y:\n"
Result: .asciiz "The sum of the even integers between X and Y is: \n"
.text
li $v0,4 #load $v0 with the print_string code.
la $a0, Prompt #load $a0 with the message to me displayed
syscall
li $v0,5 #load $v0 with the read_int code for X
syscall
move $t0,$v0
li $v0,5 #load $v0 with the read_int code for Y
syscall
move $t1, $v0
while:
slt $t2, $t1,$t0 #$t1 = y $t0 = x
li $t3,2
div $t2,$t3
beq $t2,$0,else
add $s1,$s1,$t0 #s1 = s1 + x
addi $t0,$t0,1 #x++
j while
else:
li $v0,4
la $a0, Result
syscall
move $a0,$s1
li $v0,1
syscall
I think my error is in the loop in my MIPS code. My result keeps producing zero and I think my code is checking the loop and then just jumping to my else statement.
After further work, I got it to calculate the sum of all integers and I'm not exactly sure why it is doing so. Here is my most recent update:
while:
sle $t2, $t0,$t1 #$t1 = y $t0 = x
li $t3,2 #t3 = 2
div $t2,$t3 #$t2/2
beq $t2,$0, else #if ($t2/2 == 0), jump to the else, otherwise do else
add $s1,$s1,$t0 #s1 = s1 + x
addi $t0,$t0,1 #x++
j while
So now, if I enter 1 and 5, it calculates 1 and 3 is gives me 6 instead of just the even sum which should be just 2.
To answer my own question, the main confusion was with the the branches. I now understand that they kind of work like opposites so for example, I had to set the "beq" in my while loop to bnez so it would do the calculations when $t2 was != 0. Another minor fix was adding the increment outside of the loop. So, when $t2 != 0, I jump to my "else" which then incremented to find the next number. However, if the remainder was 0, it did the math of sum=sum + x. In conclusion, the main confusion came from thinking opposite about the branches. I now understand that if I wanted to say:
while(a1 < a2)
I would have to write it as
while:
bgeu $a1,$a2, done
addi "whatever"
b while
done:
do done stuff
Before this understanding, I was writing it as ble $a1,$a2,done and that is not the way it is to be typed. Logically, that says if a1 < a2...but it is really saying if a1 < a2, jump to the "done" and skip calculations. So I just had to think opposite.
In class we were learning how to convert C code to MIPS instructions, but i ran into a small problem. Just wanted some clarification as far as to what exactly the last line of the MIPS instructions was actually saying.
c:
do{
i=i-2;
}while(i>1);
mips:
DO: addi s1,s1,-2 // i=i-2
addi t0,t0, 1 // 1
slt t1,t0,s1 // 1<i
bne t1,$zero,DO // ???
do{
i=i-2;
}while(i>1);
The assembly code for while and do while loops test the opposite condition of the one given in high-level code. If that opposite condition is TRUE, the while loop exit.
addi $s1,$0,0 # i = 0
addi $t1,$0,1 # j = 1
while:
addi $s1,$s1,-2 #i = i -2
beq $s1,$t1,done #Branch to done If i = 1 (the opposite)
j while # jump to while for loop through again
done:
This is a homework assignment, I've written the whole program myself, run through it in the debugger, and everything plays out the way I mean it to EXCEPT for this line:
sw $t1, counter($a3)
The assignment is to convert this snippet of C code to MIPS
for(i = 0; i < a; i++) {
for(j = 0; j < b; j++) {
C[2 * i] = i – j; } }
All the registers change values the way they should in my program except for $a3 - It never changes.
Changes: An array needed to be declared and "pointed to" by a register and a label can't be used for an offset in the manner I started with
EDIT: Here's the finished, working code
Recap answer from the comments
Your $a3 register, is supposed to be loaded with the address of an array defined in the .data section.
One big problem with your code is how you constructed your loops. The best way is to translate your loops step by step, and one loop at a time. Also, remember that :
for( i = 0; i < a; i++ )
{
loop_content;
}
Is equivalent to :
i = 0;
while( i < a )
{
loop_content;
i++;
}
Which is easier to translate in assembly. The condition just have to be negated, has you need an "exit" condition, and not a "continue" condition as in a while loop. Your code will be much clearer and easier to understand (and less error prone).
Your "out of range" error comes from here : sw $t1, counter($a3). Here counter is a label, and therefore an address. Thus counter($a3) is doing "$a3 (=0x10010008) + address of counter (=0x100100f8)", giving 0x20020100, which is clearly not what you want (and non-sense).
Oh, and in the sw $r, offset($a) MIPS instruction, offset MUST be a 16-bit CONSTANT. Here, you use a 32-bit address, but it's just that the assembler kindly translate sw $t1, counter($a3) to $x = $a3 + counter; sw $t1, 0($x), which is why you may see a sw with 0 as offset.