Expanding or avoiding addiu in MIPS - c

I've implemented a program (a fully-connected layer) in C, which needs to be compiled to MIPS to run on a specific microprocessor in order to test the microprocessor's function. Since the ADDIU instruction is not part of this processor's instruction set, I am editing the C program to produce fewer ADDIU instructions at compile time and trying to edit the remaining ones out of the MIPS code (ADD and ADDU are allowed). However, I am brand new to MIPS and want to make sure my edits don't change the function of the program. Is there an expansion for ADDIU using other instructions? If not, any ideas for how I can change the logic of my program to avoid using them?
I am developing a test for a specific microprocessor with a limited MIPS instruction set. Many of the problematic instructions in the compiled code can be expanded to use only instructions in the set, so I will edit the compiled code to include those expansions. However, ADDIU doesn't seem to have an expansion according to the expansion guides I've seen.
I've already gotten rid of some ADDIU instructions by storing commonly-used values as constants so I can refer to variables instead of literals in the rest of the C code, resulting in ADDU instructions (which are allowed). The ADDIU instructions which I'm having trouble editing out occur in the following places:
Manipulating or accessing the values of the stack and frame pointers. I've thought about hard-coding the addends as constants, but I'm not sure if that's even possible or if it would change the values in question.
e.g. addiu $sp,$sp,-3840
e.g. addiu $3,$fp,52
Accessing the high/low parts of 32-bit integers separately using %hi and %lo and adding them together
e.g. lui $2,%hi(output_layer.3511)
addiu $2,$2,%lo(output_layer.3511)
Note: output_layer is an array of 32-bit ints.
Addiu instructions that occur when I compile the "mod" function in C (expanding the mod function to get the remainder "the hard way" didn't help) e.g. fracPart = currentInput % 256; in C
compiles to
lw $3,40($fp)
li $2,-2147483648 # 0xffffffff80000000
ori $2,$2,0xff
and $2,$3,$2
bgez $2,$L17
nop
addiu $2,$2,-1
li $3,-256 # 0xffffffffffffff00
or $2,$2,$3
addiu $2,$2,1
$L17:
sw $2,48($fp)
The goal is working MIPS code which contains only instructions in the instruction set of this particular microprocessor, which does not include ADDIU.

addiu and addi are almost identical. The only difference is that addi generates an exception when there is an overflow in the addition, while addiu does not generate an overflow.
So, you can replace all addiu by addi.
Manipulating or accessing the values of the stack and frame pointers. I've thought about hard-coding the addends as constants, but I'm not sure if that's even possible or if it would change the values in question.
No problem to replace addi by addi. No sane software can create addresses in sp/fp that can generate an overflow in this situation.
Accessing the high/low parts of 32-bit integers separately using %hi and %lo and adding them together
You can use addi, but people generally use an ori for this operation.
lui $2,%hi(output_layer.3511)
ori $2,$2,%lo(output_layer.3511)
In either case, there is no risk of overflow (as the 16 LSB are cleared by the lui) and addi, addiu and ori are strictly equivalent.
Addiu instructions that occur when I compile the "mod" function in C (expanding the mod function to get the remainder "the hard way" didn't help) e.g. fracPart = currentInput % 256; in C compiles to
lw $3,40($fp)
li $2,-2147483648 # 0xffffffff80000000
ori $2,$2,0xff
and $2,$3,$2
bgez $2,$L17
nop
addiu $2,$2,-1
li $3,-256 # 0xffffffffffffff00
or $2,$2,$3
addiu $2,$2,1
$L17:
sw $2,48($fp)
This code seems very strange. Why not replace the two lines (li+ori) by
li $2, 0xffffffff800000ff
The last part (after bgez) is only executed by strictly negative numbers and for them, it is equivalent to an or with 0xffffffffffffff00, and the pair of addiu seems useless...
Anyway they can also be replaced by addi.
EDIT:
If addi is not available, you can copy the immediate in a free register and then perform an add/addu with this register. In most MIPS conventions, $1 is used to store temporaries by the asm and is never used by the compilers. So yo can freely use it (provided you do not use macros that may use this register).
A systematic translation of addiu can be
addiu $d, $s, imm
## ->
ori $1, $0, imm
add $d, $s, $1
Both ori and add are real instructions and $1 can be used safely. In some assemblers, you must use $at (assembler temporary) instead of $1.

I work in a lab which develops new microprocessors, so this micro is not commercially available.
According to my understanding based on your statement, the compiler is also under development. You should discuss with the team developing the compiler about this issue, so they can take your needs into account.

Related

How to automatically increment offset in lw mips?

I need to iterate through the array below. However the values of the array need to be stored at 0x10010080, but the program needs to start at 0x10010000. Is there a way for me to create a loop that automatically increments the offset by 4, so that the next lw instruction is:
lw $t0, 132($s0)
.data 0x10010080
array: .word 0x10010008, 0x1001000C, 0x10010006, 0x1001000D, 0x10010002
.text
li $s0, 0x10010000
# store the value 0x10010008 in t0
lw $t0, 128($s0)
MIPS has no automatic increment.  Even processors that do (e.g. ARM) require you to request the increment, giving evidence to the argument that it is not really automatic.  However, since MIPS doesn't even have the ability to request increment as part of dereference, then simply increase the value of the pointer by using a separate instruction.
As always, if you cannot do it in one instruction, then use a sequence of one or more.
As another point, asking for an instruction that does 128(t0) to become an instruction that is 132(t0) is called self-modifying code.  Older processors supported this concept for two reasons (1) it was necessary because of missing instructions (instructions that take constants but not variables), and (2) because older architectural designs didn't care about code being modified like data — for newer processors, most missing concepts have been provided for in an instruction form, and, due to their more modern designs, they don't like it when code is modified on the fly as it hampers performance or just doesn't work without cache synchronization (which, using it, is also a performance issue).
If you want to access a different location from the first, then use normal/regular array indexing or any other pointer arithmetic to refer to the (new) desired location.

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

Assembly optimizing function call, taking constants out of function allowed?

I have a recursive C function
foo (int numFruits) {
....
// recurse at some point
}
inside the main function.
The corresponding assembly would look like this:
.pos 0x500
main:
%r10 // contains numFruits
call foo
halt
.pos 0x4000
foo: // recursive
irmovq $8, %r13 // load value 8 into %r13
...
Inside foo, I am using constant value for size of quad which is 8 bytes long. (value 8 is not present in C code, but I am using this value to turn length of array to corresponding address, etc...)
If I load this value every time when foo is called recursively, I think it's wasting cycles. I was wondering whether compiler is able to optimize it such that constants are loaded before calling foo in the main?
Example: loading value 8 into r13 once before calling foo, so that this doesn't have to be loaded every single time. (provided that r13 is restored to its original state before loading value 8, after hitting halt)
If I were to save value 8 into r13 before main, would this be still preserving the spirit of foo(int numFruits) or is my change equivalent to foo(int numFruits, int quadSize)?
Thank you
It's equivalent to foo(int numFruits, long quadSize). Well maybe int quadSize if your y86 ABI has 64-bit int. All of the normal x86-64 ABIs have 32-bit int, and Windows x64 even has 32-bit long.
You also tagged this x86. x86-64 can move 8 into a 64-bit register with a 5-byte instruction like mov $8, %r13d: 1 opcode byte + imm32. (Actually 6 bytes including the REX prefix). You only need mov r64, imm64 for constants that don't fit in a zero or sign-extended 32-bit immediate. Writing a 32-bit register zero-extends in to the full 64-bit register. You can even code-golf constant setup even more, at the cost of speed. Like push $imm8 / pop %r13 in 3 bytes (actually 4 for a REX prefix). When optimizing for code size, you want to avoid r8..r15. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/132985#132985.
I have no idea if y86 has efficient machine-code encodings for small constants.
There are no physical y86 CPUs in existence that I know of. There are emulators, but I'm not sure if there are even any virtual designs (like verilog) of y86 hardware that could be simulated in a cycle-accurate simulator.
So any talk of "saving cycles" is a stretch for y86. Real x86-64 CPUs are pipelined superscalar with out-of-order-execution, and often not bottlenecked on code-fetch. Especially in modern CPUs with a uop cache. Depending on the loop, extra mov-immediate instructions off the critical path might not slow things down any. https://agner.org/optimize/, and see performance links in the x86 tag wiki.
But yes, you should generally hoist constant-setup out of loops.
If your "loop" is recursion that you can't easily optimize into a normal loop without an expensive call / ret, you can certainly make a wrapper function for public usage, and have it fall through into a private function that effectively uses a custom calling convention (which assumes that %r13 = 8).
.globl foo
foo:
irmovq $8, %r13
# .p2align 4 # optional 16-byte alignment for the recursion entry point
# fall through
.Lprivate_foo:
# only reachable with r13=8
# blah blah using r13=8
call .Lprivate_foo
# blah blah still assuming r13=8
call .Lprivate_foo
# more stuff
ret # the final return
Nothing else can call private_foo; it's a local label (.Lxxx) only visible from this source. So the body of .Lprivate_foo can assume that R13 = 8.
If r13 is a call-preserved register in your y86 calling convention (like it is in x86-64 System V), then either pick a call-clobbered register like r11, or have the public wrapper function call private_foo so it can restore the caller's r13 before returning. Using a register that functions are normally allowed to clobber makes this near-zero extra overhead instead of introducing an extra level of call/ret.
But that only works if you don't call any other unknown functions from inside your recursive function, otherwise you have to assume they clobbered R11.
Optimizing recursion into a loop has a big advantage, and compilers do that whenever possible. (In a double-recursive function like a tree traversal, they will often turn the 2nd recursive call into a loop branch, but still actually recurse for the non-tail recursion.)
If you're just using 8 as a scale factor, I'm worried that you're using a multiply. It's much more efficient to use a shift by 3. Or (since you tagged this x86 as well as y86), maybe use a scaled-index addressing mode. But if it's for a pointer increment, then real x86 would use an add-immediate. Like add $8, %rsi, using the add r/m64, imm8 encoding which only uses 1 byte for the constant (sign-extended to 64-bit).
But the x86 equivalent would be SIMD vector constants or floating point constants, because there aren't immediate forms of those. And in that case yes you do want set up the constant in a register outside of a loop.

When calling into C from MIPS do we use JR or JALR?

I'm writing some self-modifying code and I want to call a C function (call it foo) from MIPS.
I've loaded the mem add of foo, &foo into $t1. Now I jr/jalr $t1.
Will C set $ra as my current PC(+8) (before the jump) allowing me to use JR or will I have to JALR instead (ie, take care of it because C won't)?
I can't test this right now as I'm debugging my macros (for the MIPS mnemonics). I must have messed some up because I keep getting illegal instruction errors.
You have to use JALR. How could the C function possibly know how to set $ra for you?

SPIM (MIPS Simulator) is unable to parse the following statement lui $2,%hi($LC0)

We are developing a C to MIPS cross compiler, and for that we need to compare our MIPS code with the code generated by other C to MIPS cross compiler like LLVM, or CodeSourcery, etc. We are using SPIM to run our MIPS code.
Right now we are using CodeSourcery cross compiler to generate the corrresponding MIPS code for a given C program. It generates statements like lui $2,%hi($LC0) , lw $3,%lo($LC0)($2), where LCO is a label, but while loading that file in SPIM. It shows parser error like
(parser) Unknown character on line 30 of file test1.s
lui $2,%hi($LC0)
I don't know what is the equivalent MIPS code for this kind of statement, so that it can be parsed by SPIM?
The %hi(addr) and %lo(addr) syntax is used by Gnu assembler, and it works as shown in the code above. SPIM does not support this syntax, because it is intended as simple teaching tool (see this question).
If you are developing a real MIPS cross compilation tool, you should test it with a full MIPS simulator like the free OVP MIPS Simulator. OVP is tested to be compatible with real MIPS processors.
In the book "See MIPS Run 2nd Edition" by Morgan Kaufmann there are equivalents MIPS code for the instructions that use "hi" and "lo" registers for e.g.
lw $2, addr => lui at, %hi(addr)
lw $2, %lo(addr)(at)
sw $2, addr($3) => lui at, %hi(addr)
addu at, at, $3
sw $2, %lo(addr)(at)
la $2, addr => lui at, %hi(addr)
addiu $2, at, %lo(addr)
la $2, addr($3) => lui at, %hi(addr)
addiu $2, at, %lo(addr)
addu $2, $2, $3
We can put the equivalent code which does not contain "hi" and "lo" registers and make SPIM run our program successfully.
You can use the GCC flag -mno-explicit-relocs to avoid assembler relocation operators when dealing with symbolic addresses (i.e. %hi and %lo).
For example:
mips-elf-gcc -S kernel.c -ffreestanding -mips32
-mno-explicit-relocs -Wall
You can check GCC MIPS options in https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/MIPS-Options.html
That's not a legal syntax for MIPS assembly.
Memory access in MIPS are usually written like
offset(register_pointer)
For example:
lw $t0, 0($sp)
AFAIK offset is an immediate 16-bit signed integer and cannot be anything else (it cannot be a register, a label, etc...).

Resources