Is it possible to realize subroutine without indirect addressing? - c

I am working on the Simple-Compiler project in Deitel's book C how to program. Its main goal is to generate a compiler for an advanced language called SIMPLE and the relevant machine language is called SIMPLETRON.
I've completed some basic features for this compiler but am now stuck with an enhanced requirement -- to realize gosub and return (subroutine features) for SIMPLE language.
The main obstacle here is that SIMPLETRON doesn't support indirect addressing, which means the strategy to use stack for returning addresses of subroutines can't work. In this case, is it possible to somehow make subroutines work?
PS: I searched this issue and found an relevant question here. It seemed self-modifying code might be the answer, but I failed to find specific resolutions and thus I still raised this question. Moreover in my opinion machine instructions for SIMPLETRON has to be extended to make self-modifying code work here, right?
Background information for SIMPLETRON machine language:
It includes only one accumulator as register.
All supported machine instructions as below:
Input/output operations
#define READ 10: Read a word from the terminal into memory and with an operand as the memory address.
#define WRITE 11: Write a word from memory to the terminal and with an operand as the memory address.
Load/store operations
#define LOAD 20: Load a word from memory into the accumulator and with an operand as the memory address.
#define STORE 21: Store a word from the accumulator into memory and with an operand as the memory address.
Arithmetic operations
#define ADD 30: Add a word from memory to the word in the accumulator (leave result in accumulator) and with an operand as the
memory address.
#define SUBTRACT 31: Subtract a word ...
#define DIVIDE 32: Divide a word ...
#define MULTIPLY 33: Multiply a word ...
Transfer of control operations
#define BRANCH 40: Branch and with an operand as the code location.
#define BRANCHNEG 41: Branch if the accumulator is negative and with an operand as the code location.
#define BRANCHZERO 42: Branch if the accumulator is zero and with an operand as the code location.
#define HALT 43: End the program. No operand.

I'm not familiar with SIMPLE or SIMPLETRON, but in general I can think of at least 3 approaches.
Self-modifying code
Have a BRANCH 0 instruction at the end of each subroutine, and before that, code to load the return address into the accumulator and STORE it into the code itself, thus effectively forming a BRANCH <dynamic> instruction.
Static list of potential callers
If SIMPLE doesn't have indirect calls (i.e. every gosub targets a statically known subroutine), then the compiler knows the list of possible callers of each subroutine. Then it could have each call pass a unique argument (e.g. in the accumulator), which the subroutine can test (pseudocode):
SUBROUTINE:
...
if (arg == 0)
branch CALLER_1;
if (arg == 1)
branch CALLER_2;
if (arg == 2)
branch CALLER_3;
Inlining
If SIMPLE doesn't allow recursive subroutines, there's no need to implement calls at the machine code level at all. Simply inline every subroutine into its caller completely.

Yes, you can do this, even reasonably, without self-modifying code.
You turn your return addresses into a giant case statement.
The secret is understanding that a "return address" is just a way
to get back to point of the call, and that memory is just a giant
array of named locations.
Imagine I have a program with many logical call locations, with the instruction
after the call labelled:
CALL S
$1: ...
...
CALL T
$2: ...
...
CALL U
$3: ...
We need to replace the CALLs with something our machine can implement.
Let's also assume temporarily that only one subroutine call is active at any moment.
Then all that matters, is that after a subroutine completes, that control
returns to the point after the call.
You can cause this by writing the following SIMPLETRON code (I'm making up the syntax). By convention I assume I have a bunch of memory locations K1, K2, ... that contain the constants 1, 2, .. etc for as many constants as a I need.
K1: 1
K2: 2
K3: 3
...
LOAD K1
JMP S
$1: ...
...
LOAD K2
JMP T
$2: ...
...
LOAD K3
JMP U
$3:....
S: STORE RETURNID
...
JMP RETURN
T: STORE RETURNID
...
JMP RETURN
U: STORE RETURNID
...
JMP RETURN
RETURN: LOAD RETURNID
SUB K1
JE $1
LOAD RETURNID
SUB K2
JE $2
LOAD RETURNID
SUB K3
JE $3
JMP * ; bad return address, just hang
In essence, each call site records a constant (RETURNID) unique to that call site, and "RETURN" logic uses that unique ID to figure out the return point. If you have a lot of subroutines, the return logic code might be quite long, but hey, this is a toy machine and we aren't that interested in efficiency.
You could always make the return logic into a binary decision tree; then
the code might be long but it would only take log2(callcount) to decide how to get back, not actually all that bad).
Let's relax our assumption of only one subroutine active at any moment.
You can define for each subroutine a RETURNID, but still use the same RETURN code. With this idea, any subroutine can call any other subroutine. Obviously these routines are not-reentrant, so they can't be called more than once in any call chain.
We can use this same idea to implement a return stack. The trick is to recognize that a stack is merely a set of memory locations with an address decoder that picks out members of the stack. So, lets implement
PUSH and POP instructions as subroutines. We change our calling convention
to make the caller record the RETURNID, leaving the accumulator free
to pass a value:
LOAD K1
STORE PUSHRETURNID
LOAD valuetopush
JMP PUSH
$1:
LOAD K2
STORE POPRETURNID
JMP POP
$2:...
TEMP:
STACKINDEX: 0 ; incremented to 1 on first use
STACK1: 0 ; 1st stack location
...
STACKN: 0
PUSH: STORE TEMP ; save value to push
LOAD PUSHRETURNID ; do this here once instead of in every exit
STORE RETURNID
LOAD STACKINDEX ; add 1 to SP here, once, instead of in every exit
ADD K1
STORE STACKINDEX
SUB K1
JE STORETEMPSTACK1
LOAD STACKINDEX
SUB K2
JE STORETEMPSTACK2
...
LOAD STACKINDEX
SUB Kn
JE STORETEMPSTACKn
JMP * ; stack overflow
STORETEMPSTACK1:
LOAD TEMP
STORE STACK1
JMP RETURN
STORETEMPSTACK2:
LOAD TEMP
STORE STACK2
JMP RETURN
...
POP: LOAD STACKINDEX
SUB K1 ; decrement SP here once, rather than in every exit
STORE STACKINDEX
LOAD STACKINDEX
SUB K0
JE LOADSTACK1
LOAD STACKINDEX
SUB K1
JE LOADSTACK2
...
LOADSTACKn:
LOAD STACKn
JMP POPRETURN
LOADSTACK1:
LOAD STACK1
JMP RETURNFROMPOP
LOADSTACK2:
LOAD STACK2
JMP RETURNFROMPOP
RETURNFROMPOP: STORE TEMP
LOAD POPRETURNID
SUB K1
JE RETURNFROMPOP1
LOAD POPRETURNID
SUB K2
JE RETURNFROMPOP2
...
RETURNFROMPOP1: LOAD TEMP
JMP $1
RETURNFROMPOP2: LOAD TEMP
JMP $2
Note that we need RETURN, to handle returns with no value, and RETURNFROMPOP, that handles returns from the POP subroutine with a value.
So these look pretty clumsy, but we can now realize a pushdown stack
of fixed but arbitrarily large depth. If we again make binary decision trees out the stack location and returnID checking, the runtime costs are only logarithmic in the size of the stacks/call count, which is actually pretty good.
OK, now we have general PUSH and POP subroutines. Now we can make calls that store the return address on the stack:
LOAD K1 ; indicate return point
STORE PUSHRETURNID
LOAD K2 ; call stack return point
JMP PUSH
$1: LOAD argument ; a value to pass to the subroutine
JMP RECURSIVESUBROUTINEX
; returns here with subroutine result in accumulator
$2:
RECURSIVESUBROUTINEX:
...compute on accumulator...
LOAD K3 ; indicate return point
STORE PUSHRETURNID
LOAD K4 ; call stack return point
JMP PUSH
$3: LOAD ... ; some revised argument
JMP RECURSIVESUBROUTINEX
$4: ; return here with accumulator containing result
STORE RECURSIVESUBROUTINERESULT
LOAD K5
STORE POPRETURNID
JMP POP
$5: ; accumulator contains return ID
STORE POPRETURNID
LOAD RECURSIVESUBROUTINERESULT
JMP RETURNFROMPOP
That's it. Now you have fully recursive subroutine calls with a stack, with no (well, faked) indirection.
I wouldn't want to program this machine manually because building the RETURN routines would be a royal headache to code and keep right. But a compiler would be perfectly happy to manufacture all this stuff.

Although there's no way to get the current instruction's location from within the SIMPLE instruction set, the assembler can keep track of instruction locations in order to generate the equivalent of return instructions.
The assembler would generate a branch to address instruction in the program image to be used as a return instruction, then to implement a call it would generate code to load a "return instruction" and store it at the end of a subroutine before branching to that subroutine. Each instance of a "call" would require an instance of a "return instruction" in the program image. You may want to reserve a range of variable memory to store these "return instructions".
Example "code" using a call that includes the label of the return instruction as a parameter:
call sub1, sub1r
; ...
sub1: ; ...
sub1r: b 0 ;this is the return instruction
Another option would be something similar to MASM PROC and ENDP, where the ENDP would hold the return instruction. The call directive would assume that the endp direction holds the branch to be modified and the label would be the same as the corresponding proc directive.
call sub1
; ...
sub1 proc ;subroutine entry point
; ...
sub1 endp ;subroutine end point, "return" stored here
The issue here is that the accumulator would be destroyed by the "call" (but not affected by the "return"). If needed, subroutine parameters could be stored as variables, perhaps using assembler directives for labeling:
sub1 parm1 ;parameter 1 for sub1
;....
load sub1.parm1 ;load sub1 parameter 1

Related

Unwinding frame but do not return in C

My programming language compiles to C, I want to implement tail recursion optimization. The question here is how to pass control to another function without "returning" from the current function.
It is quite easy if the control is passed to the same function:
void f() {
__begin:
do something here...
goto __begin; // "call" itself
}
As you can see there is no return value and no parameters, those are passed in a separate stack adressed by a global variable.
Another option is to use inline assembly:
#ifdef __clang__
#define tail_call(func_name) asm("jmp " func_name " + 8");
#else
#define tail_call(func_name) asm("jmp " func_name " + 4");
#endif
void f() {
__begin:
do something here...
tail_call(f); // "call" itself
}
This is similar to goto but as goto passes control to the first statement in a function, skipping the "entry code" generated by a compiler, jmp is different, it's argument is a function pointer, and you need to add 4 or 8 bytes to skip the entry code.
The both above will work but only if the callee and the caller use the same amount of stack for local variables which is allocated by the entry code of the callee.
I was thinking to do leave manually with inline assembly, then replace the return address on the stack, then do a legal function call like f(). But my attempts all crashed. You need to modify BP and SP somehow.
So again, how to implement this for x64? (Again, assuming functions have no arguments and return void). Portable way without inline assembly is better, but assembly is accepted. Maybe longjump can be used?
Maybe you can even push the callee address on the stack, replacing the original return address and just ret?
Do not try to do this yourself. A good C compiler can perform tail-call elimination in many cases and will do so. In contrast, a hack using inline assembly has a good chance of going wrong in a way that is difficult to debug.
For example, see this snippet on godbolt.org. To duplicate it here:
The C code I used was:
int foo(int n, int o)
{
if (n == 0) return o;
puts("***\n");
return foo(n - 1, o + 1);
}
This compiles to:
.LC0:
.string "***\n"
foo:
test edi, edi
je .L4
push r12
mov r12d, edi
push rbp
mov ebp, esi
push rbx
mov ebx, edi
.L3:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
jne .L3
lea eax, [r12+rbp]
pop rbx
pop rbp
pop r12
ret
.L4:
mov eax, esi
ret
Notice that the tail call has been eliminated. The only call is to puts.
Since you don't need arguments and return values, how about combining all function into one and use labels instead of function names?
f:
__begin:
...
CALL(h); // a macro implementing traditional call
...
if (condition_ret)
RETURN; // a macro implementing traditional return
...
goto g; // tail recurse to g
The tricky part here is RETURN and CALL macros. To return you should keep yet another stack, a stack of setjump buffers, so when you return you call longjump(ret_stack.pop()), and when you call you do ret_stack.push(setjump(f)). This is poetical rendition ofc, you'll need to fill out the details.
gcc can offer some optimization here with computed goto, they are more lightweight than longjump. Also people who write vms have similar problems, and seemingly have asm-based solutions for those even on MSVC, see example here.
And finally such approach even if it saves memory, may be confusing to compiler, so can cause performance anomalies. You probably better off generating for some portable assembler-like language, llvm maybe? Not sure, should be something that has computed goto.
The venerable approach to this problem is to use trampolines. Essentially, every compiled function returns a function pointer (and maybe an arg count). The top level is a tight loop that, starting with your main, simply calls the returned function pointer ad infinitum. You could use a function that longjmps to escape the loop, i.e., to terminate the progam.
See this SO Q&A. Or Google "recursion tco trampoline."
For another approach, see Cheney on the MTA, where the stack just grows until it's full, which triggers a GC. This works once the program is converted to continuation passing style (CPS) since in that style, functions never return; so, after the GC, the stack is all garbage, and can be reused.
I will suggest a hack. The x86 call instruction, which is used by the compiler to translate your function calls, pushes the return address on the stack and then performs a jump.
What you can do is a bit of a stack manipulation, using some inline assembly and possibly some macros to save yourself a bit of headache. You basically have to overwrite the return address on the stack, which you can do immediately in the function called. You can have a wrapper function which overwrites the return address and calls your function - the control flow will then return to the wrapper which then moves to wherever you pointed it to.

Is there any operation in C analogous to this assembly code?

Today, I played around with incrementing function pointers in assembly code to create alternate entry points to a function:
.386
.MODEL FLAT, C
.DATA
INCLUDELIB MSVCRT
EXTRN puts:PROC
HLO DB "Hello!", 0
WLD DB "World!", 0
.CODE
dentry PROC
push offset HLO
call puts
add esp, 4
push offset WLD
call puts
add esp, 4
ret
dentry ENDP
main PROC
lea edx, offset dentry
call edx
lea edx, offset dentry
add edx, 13
call edx
ret
main ENDP
END
(I know, technically this code is invalid since it calls puts without the CRT being initialized, but it works without any assembly or runtime errors, at least on MSVC 2010 SP1.)
Note that in the second call to dentry I took the address of the function in the edx register, as before, but this time I incremented it by 13 bytes before calling the function.
The output of this program is therefore:
C:\Temp>dblentry
Hello!
World!
World!
C:\Temp>
The first output of "Hello!\nWorld!" is from the call to the very beginning of the function, whereas the second output is from the call starting at the "push offset WLD" instruction.
I'm wondering if this kind of thing exists in languages that are meant to be a step up from assembler like C, Pascal or FORTRAN. I know C doesn't let you increment function pointers but is there some other way to achieve this kind of thing?
AFAIK you can only write functions with multiple entry-points in asm.
You can put labels on all the entry points, so you can use normal direct calls instead of hard-coding the offsets from the first function-name.
This makes it easy to call them normally from C or any other language.
The earlier entry points work like functions that fall-through into the body of another function, if you're worried about confusing tools (or humans) that don't allow function bodies to overlap.
You might do this if the early entry-points do a tiny bit of extra stuff, and then fall through into the main function. It's mainly going to be a code-size saving technique (which might improve I-cache / uop-cache hit rate).
Compilers tend to duplicate code between functions instead of sharing large chunks of common implementation between slightly different functions.
However, you can probably accomplish it with only one extra jmp with something like:
int foo(int a) { return bigfunc(a + 1); }
int bar(int a) { return bigfunc(a + 2); }
int bigfunc(int x) { /* a lot of code */ }
See a real example on the Godbolt compiler explorer
foo and bar tailcall bigfunc, which is slightly worse than having bar fall-through into bigfunc. (Having foo jump over bar into bigfunc is still good, esp. if bar isn't that trivial.)
Jumping into the middle of a function isn't in general safe, because non-trivial functions usually need to save/restore some regs. So the prologue pushes them, and the epilogue pops them. If you jump into the middle, then the pops in the prologue will unbalance the stack. (i.e. pop off the return address into a register, and return to a garbage address).
See also Does a function with instructions before the entry-point label cause problems for anything (linking)?
You can use the longjmp function: http://www.cplusplus.com/reference/csetjmp/longjmp/
It's a fairly horrible function, but it'll do what you seek.

Lookup table vs switch in C embedded software

In another thread, I was told that a switch may be better than a lookup table in terms of speed and compactness.
So I'd like to understand the differences between this:
Lookup table
static void func1(){}
static void func2(){}
typedef enum
{
FUNC1,
FUNC2,
FUNC_COUNT
} state_e;
typedef void (*func_t)(void);
const func_t lookUpTable[FUNC_COUNT] =
{
[FUNC1] = &func1,
[FUNC2] = &func2
};
void fsm(state_e state)
{
if (state < FUNC_COUNT)
lookUpTable[state]();
else
;// Error handling
}
and this:
Switch
static void func1(){}
static void func2(){}
void fsm(int state)
{
switch(state)
{
case FUNC1: func1(); break;
case FUNC2: func2(); break;
default: ;// Error handling
}
}
I thought that a lookup table was faster since compilers try to transform switch statements into jump tables when possible.
Since this may be wrong, I'd like to know why!
Thanks for your help!
As I was the original author of the comment, I have to add a very important issue you did not mention in your question. That is, the original was about an embedded system. Presuming this is a typical bare-metal system with integrated Flash, there are very important differences from a PC on which I will concentrate.
Such embedded systems typically have the following constraints.
no CPU cache.
Flash requires waitstates for higher (i.e. >ca. 32MHz) CPU clocks. The actual ratio depends on the die design, low power/high speed process, operating voltage, etc.
To hide waitstates, Flash has wider read-lines than the CPU-bus.
This only works well for linear code with instruction prefetch.
Data accesses disturb instruction prefetch or are stalled until it finished.
Flash might have an internal very small instruction cache.
If any at all, there is an even smaller data-cache.
The small caches result in more frequent trashing (replacing a previous entry before that has been used another time).
For e.g. the STM32F4xx a read takes 6 clocks at 150MHz/3.3V for 128 bits (4 words). So if a data-access is required, chances are good it adds more than 12 clocks delay for all data to be fetched (there are additional cycles involved).
Presuming compact state-codes, for the actual problem, this has the following effects on this architecture (Cortex-M4):
Lookup-table: Reading the function address is a data-access. With all implications mentioned above.
A switch otoh uses a special "table-lookup" instruction which uses code-space data right behind the instruction. So the first entries are possibly already prefetched. Other entries don't break the prefetch. Also the access is a code-acces, thus the data goes into the Flash's instruction cache.
Also note that the switch does not need functions, thus the compiler can fully optimise the code. This is not possible for a lookup table. At least code for function entry/exit is not required.
Due to the aforementioned and other factors, an estimate is hard to tell. It heavily depends on your platform and the code structure. But assuming the system given above, the switch is very likely faster (and clearer, btw.).
First, on some processors, indirect calls (e.g. through a pointer) - like those in your Lookup Table example - are costly (pipeline breakage, TLB, cache effects). It might also be true for indirect jumps...
Then, a good optimizing compiler might inline the call to func1() in your Switch example; then you won't run any prologue or epilogue for an inlined functions.
You need to benchmark to be sure, since a lot of other factors matter on the performance. See also this (and the reference there).
Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.
(update: gcc -fpie (on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)
I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.
void fsm_switch(int state) {
switch(state) {
case FUNC0: func0(); break;
case FUNC1: func1(); break;
case FUNC2: func2(); break;
case FUNC3: func3(); break;
default: ;// Error handling
}
//prevent_tailcall();
}
void fsm_lut(state_e state) {
if (likely(state < FUNC_COUNT)) // without likely(), gcc puts the LUT on the taken side of this branch
lookUpTable[state]();
else
;// Error handling
//prevent_tailcall();
}
See also
How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?
x86
On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.
GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.
They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov, and then a memory-indirect jump with an indexed addressing mode.
ARM:
gcc 4.8.2 with -mcpu=cortex-m4 -O2 makes interesting code.
As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3). This is a normal unconditional jump, since it's a tail-call.
Each table destination entry needs significantly more code (Godbolt) if fsm_switch does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void); is declared but not defined), or if this is inlined into a larger function.
## With void prevent_tailcall(void){} defined so it can inline:
## Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
cmp r0, #3 # state,
bhi .L5 #
tbb [pc, r0] # state
## There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg. And apparently it's even loaded from I-cache, not D-cache
.byte (.L7-.L8)/2
.byte (.L9-.L8)/2
.byte (.L10-.L8)/2
.byte (.L11-.L8)/2
.L11:
b func3 # optimized tail-call
.L10:
b func2
.L9:
b func1
.L7:
b func0
.L5:
bx lr # This is ARM's equivalent of an x86 ret insn
IDK if there's much difference between how well branch prediction works for tbb vs. a full-on indirect jump or call (blx), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch.
I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.
Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.
The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:
fsm_lut:
cmp r0, #3 # state,
bhi .L13 #,
movw r3, #:lower16:.LANCHOR0 # tmp112,
movt r3, #:upper16:.LANCHOR0 # tmp112,
ldr r3, [r3, r0, lsl #2] # tmp113, lookUpTable
bx r3 # indirect register sibling call # tmp113
.L13:
bx lr #
# in the .rodata section
lookUpTable:
.word func0
.word func1
.word func2
.word func3
See Mike of SST's answer for a similar analysis on a Microchip dsPIC.
msc's answer and the comments give you good hints as to why performance may not be what you expect. Benchmarking is the rule, but results will vary from one architecture to another, and may change with other versions of the compiler and of course its configuration and options selected.
Note however that your 2 pieces of code do not perform the same validation on state:
The switch will gracefully do nothing is state is not one of the defined values,
The jump table version will invoke undefined behavior for all but the 2 values FUNC1 and FUNC2.
There is no generic way to initialize the jump table with dummy function pointers without making assumptions on FUNC_COUNT. Do get the same behavior, the jump table version should look like this:
void fsm(int state) {
if (state >= 0 && state < FUNC_COUNT && lookUpTable[state] != NULL)
lookUpTable[state]();
}
Try benchmarking this and inspect the assembly code. Here is a handy online compiler for this: http://gcc.godbolt.org/#
On the Microchip dsPIC family of devices a look-up table is stored as a set of instruction addresses in the Flash itself. Performing the look-up involves reading the address from the Flash then calling the routine. Making the call adds another handful of cycles to push the instruction pointer and other bits and bobs (e.g. setting the stack frame) of housekeeping.
For example, on the dsPIC33E512MU810, using XC16 (v1.24) the look-up code:
lookUpTable[state]();
Compiles to (from the disassembly window in MPLAB-X):
! lookUpTable[state]();
0x2D20: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D22: ADD W4, W4, W5 ; 1 cycle (addresses are 16 bit aligned)
0x2D24: MOV #0xA238, W4 ; 1 cycle (get base address of look-up table)
0x2D26: ADD W5, W4, W4 ; 1 cycle (get address of entry in table)
0x2D28: MOV [W4], W4 ; 1 cycle (get address of the function)
0x2D2A: CALL W4 ; 2 cycles (push PC+2 set PC=W4)
... and each (empty, do-nothing) function compiles to:
!static void func1()
!{}
0x2D0A: LNK #0x0 ; 1 cycle (set up stack frame)
! Function body goes here
0x2D0C: ULNK ; 1 cycle (un-link frame pointer)
0x2D0E: RETURN ; 3 cycles
This is a total of 11 instruction cycles of overhead for any of the cases, and they all take the same. (Note: If either the table or the functions it contains are not in the same 32K program word Flash page, there will be an even greater overhead due to having to get the Address Generation Unit to read from the correct page, or to set up the PC to make a long call.)
On the other hand, providing that the whole switch statement fits within a certain size, the compiler will generate code that does a test and relative branch as two instructions per case taking three (or possibly four) cycles per case up to the one that's true.
For example, the switch statement:
switch(state)
{
case FUNC1: state++; break;
case FUNC2: state--; break;
default: break;
}
Compiles to:
! switch(state)
0x2D2C: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D2E: SUB W4, #0x0, [W15] ; 1 cycle (compare with first case)
0x2D30: BRA Z, 0x2D38 ; 1 cycle (if branch not taken, or 2 if it is)
0x2D32: SUB W4, #0x1, [W15] ; 1 cycle (compare with second case)
0x2D34: BRA Z, 0x2D3C ; 1 cycle (if branch not taken, or 2 if it is)
! {
! case FUNC1: state++; break;
0x2D38: INC [W14], [W14] ; To stop the switch being optimised out
0x2D3A: BRA 0x2D40 ; 2 cycles (go to end of switch)
! case FUNC2: state--; break;
0x2D3C: DEC [W14], [W14] ; To stop the switch being optimised out
0x2D3E: NOP ; compiler did a fall-through (for some reason)
! default: break;
0x2D36: BRA 0x2D40 ; 2 cycles (go to end of switch)
! }
This is an overhead of 5 cycles if the first case is taken, 7 if the second case is taken, etc., meaning they break even on the fourth case.
This means that knowing your data at design time will have a significant influence on the long-term speed. If you have a significant number (more than about 4 cases) and they all occur with similar frequency then a look-up table will be quicker in the long run. If the frequency of the cases is significantly different (e.g. case 1 is more likely than case 2, which is more likely than case 3, etc.) then, if you order the switch with the most likely case first, then the switch will be faster in the long run. For the edge case when you only have a few cases the switch will (probably) be faster anyway for most executions and is more readable and less error prone.
If there are only a few cases in the switch, or some cases will occur more often than others, then doing the test and branch of the switch will probably take fewer cycles than using a look-up table. On the other hand, if you have more than a handful of cases of that occur with similar frequency then the look-up will probably end up being faster on average.
Tip: Go with the switch unless you know the look-up will definitely be faster and the time it takes to run is important.
Edit: My switch example is a little unfair, as I've ignored the original question and in-lined the 'body' of the cases to highlight the real advantage of using a switch over a look-up. If the switch has to do the call as well then it only has the advantage for the first case!
To have even more compiler outputs, here what is produced by the TI C28x compiler using #PeterCordes sample code:
_fsm_switch:
CMPB AL,#0 ; [CPU_] |62|
BF $C$L3,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#1 ; [CPU_] |62|
BF $C$L2,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#2 ; [CPU_] |62|
BF $C$L1,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#3 ; [CPU_] |62|
BF $C$L4,NEQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
LCR #_func3 ; [CPU_] |66|
; call occurs [#_func3] ; [] |66|
B $C$L4,UNC ; [CPU_] |66|
; branch occurs ; [] |66|
$C$L1:
LCR #_func2 ; [CPU_] |65|
; call occurs [#_func2] ; [] |65|
B $C$L4,UNC ; [CPU_] |65|
; branch occurs ; [] |65|
$C$L2:
LCR #_func1 ; [CPU_] |64|
; call occurs [#_func1] ; [] |64|
B $C$L4,UNC ; [CPU_] |64|
; branch occurs ; [] |64|
$C$L3:
LCR #_func0 ; [CPU_] |63|
; call occurs [#_func0] ; [] |63|
$C$L4:
LCR #_prevent_tailcall ; [CPU_] |69|
; call occurs [#_prevent_tailcall] ; [] |69|
LRETR ; [CPU_]
; return occurs ; []
_fsm_lut:
;* AL assigned to _state
CMPB AL,#4 ; [CPU_] |84|
BF $C$L5,HIS ; [CPU_] |84|
; branchcc occurs ; [] |84|
CLRC SXM ; [CPU_]
MOVL XAR4,#_lookUpTable ; [CPU_U] |85|
MOV ACC,AL << 1 ; [CPU_] |85|
ADDL XAR4,ACC ; [CPU_] |85|
MOVL XAR7,*+XAR4[0] ; [CPU_] |85|
LCR *XAR7 ; [CPU_] |85|
; call occurs [XAR7] ; [] |85|
$C$L5:
LCR #_prevent_tailcall ; [CPU_] |88|
; call occurs [#_prevent_tailcall] ; [] |88|
LRETR ; [CPU_]
; return occurs ; []
I also used -O2 optimizations.
We can see that the switch is not converted into a jump table even if the compiler has the ability.

Assembly: Creating an array of linked nodes

Firstly, if this question is inappropriate because I am not providing any code or not doing any thinking on my own, I apologize, and I will delete this question.
For an assignment, we are required to create an array of nodes to simulate a linked list. Each node has an integer value and a pointer to the next node in the list. Here is my .DATA section
.DATA
linked_list DWORD 5 DUP (?) ;We are allowed to assume the linked list will have 5 items
linked_node STRUCT
value BYTE ?
next BYTE ?
linked_node ENDS
I am unsure if I am defining my STRUCT correctly, as I am unsure of what the type of next should be. Also, I am confused as how to approach this problem. To insert a node into linked_list I should be able to write mov [esi+TYPE linked_list*ecx], correct? Of course, I'd need to inc ecx every time. What I'm confused about is how to do mov linked_node.next, "pointer to next node" Is there some sort of operator that would allow me to set the pointer to the next index in the array equal to a linked_node.next ? Or am I thinking about this incorrectly? Any help would be appreciated!
Think about your design in terms of a language you are familiar with. Preferably C, because pointers and values in C are concepts that map directly to asm.
Let's say you want to keep track of your linked list by storing a pointer to the head element.
#include <stdint.h> // for int8_t
struct node {
int8_t next; // array index. More commonly, you'd use struct node *next;
// negative values for .next are a sentinel, like a NULL pointer, marking the end of the list
int8_t val;
};
struct node storage[5]; // .next field indexes into this array
uint8_t free_position = 0; // when you need a new node, take index = free_position++;
int8_t head = -1; // start with an empty list
There are tricks to reduce corner cases, like having the list head be a full node, rather than just a reference (pointer or index). You can treat it as a first element, instead of having to check for the empty-list case everywhere.
Anyway, given a node reference int8_t p (where p is the standard variable name for a pointer to a list node, in linked list code), the next node is storage[p.next]. The next node's val is storage[p.next].val.
Let's see what this looks like in asm. The NASM manual talks about how it's macro system can help you make code using global structs more readable, but I haven't done any macro stuff for this. You might define macros for NEXT and VAL or something, with 0 and 1, so you can say [storage + rdx*2 + NEXT]. Or even a macro that takes an argument, so you could say [NEXT(rdx*2)]. If you're not careful, you could end up with code that's more confusing to read, though.
section .bss
storage: resw 5 ;; reserve 5 words of zero-initialized space
free_position: db 0 ;; uint8_t free_position = 0;
section .data
head: db -1 ;; int8_t head = -1;
section .text
; p is stored in rdx. It's an integer index into storage
; We'll access storage directly, without loading it into a register.
; (normally you'd have it in a reg, since it would be space you got from malloc/realloc)
; lea rsi, [rel storage] ;; If you want RIP-relative addressing.
;; There is no [RIP+offset + scale*index] addressing mode, because global arrays are for tiny / toy programs.
test edx, edx
js .err_empty_list ;; check for p=empty list (sign-bit means negative)
movsx eax, byte [storage + 2*rdx] ;; load p.next into eax, with sign-extension
test eax, eax
js .err_empty_list ;; check that there is a next element
movsx eax, byte [storage + 2*rax + 1] ;; load storage[p.next].val, sign extended into eax
;; The final +1 in the effective address is because the val byte is 2nd.
;; you could have used a 3rd register if you wanted to keep p.next around for future use
ret ;; or not, if this is just the middle of some larger function
.err_empty_list: ; .symbol is a local symbol, doesn't have to be unique for the whole file
ud2 ; TODO: report an error instead of running an invalid insns
Notice that we get away with shorter instruction encoding by sign-extending into a 32bit reg, not to the full 64bit rax. If the value is negative, we aren't going to use rax as part of an address. We're just using movsx as a way to zero-out the rest of the register, because mov al, [storage + 2*rdx] would leave the upper 56 bits of rax with the old contents.
Another way to do this would be to movzx eax, byte [...] / test al, al, because the 8-bit test is just as fast to encode and execute as a 32bit test instruction. Also, movzx as a load has one cycle lower latency than movsx, on AMD Bulldozer-family CPUs (although they both still take an integer execution unit, unlike Intel where movsx/zx is handled entirely by a load port).
Either way, movsx or movzx is a good way to load 8-bit data, because you avoid problems with reading the full reg after writing a partial reg, and/or a false-dependency (on the previous contents of the upper bits of the reg, even if you know you already zeroed it, the CPU hardware still has to track it). Except if you know you're not optimizing for Intel pre-Haswell, then you don't have to worry about partial-register writes. Haswell does dual-bookkeeping or something to avoid extra uops to merge the partial value with the old full value when reading. AMD CPUs, P4, and Silvermont don't track partial-regs separately from the full-reg, so all you have to worry about is the false dependency.
Also note that you can load the next and val packed together, like
.search_loop:
movzx eax, word [storage + rdx*2] ; next in al, val in ah
test ah, ah
jz .found_a_zero_val
movzx edx, al ; use .next for the next iteration
test al, al
jns .search_loop
;; if we get here, we didn't find a zero val
ret
.found_a_zero_val:
;; do something with the element referred to by `rdx`
Notice how we have to use movzx anyway, because all the registers in an effective address have to be the same size. (So word [storage + al*2] doesn't work.)
This is probably more useful going the other way, to store both fields of a node with a single store, like mov [storage + rdx*2], ax or something, after getting next into al, and val into ah, probably from separate sources. (This is a case where you might want to use a regular byte load, instead of a movzx, if you don't already have it in another register). This isn't a big deal: don't make your code hard to read or more complex just to avoid doing two byte-stores. At least, not until you find out that store-port uops are the bottleneck in some loop.
Using an index into an array, instead of a pointer, can save a lot of space, esp. on 64bit systems where pointers take 8 bytes. If you don't need to free individual nodes (i.e. data structure only ever grows, or is deleted all at once when it is deleted), then an allocator for new nodes is trivial: just keep sticking them at the end of the array, and realloc(3). Or use a c++ std::vector.
With those building blocks, you should be all set to implement the usual linked list algos. Just store bytes with mov [storage + rdx*2], al or whatever.
If you need ideas on how to implement linked lists with clean algos that handle all the special-cases with as few branches as possible, have a look at this Codereview question. It's for Java, but my answer is very C-style. The other answers have some nice tricks, too, some of which I borrowed for my answer. (e.g. using a dummy node avoids branching to handle the insertion-as-a-new-head special case).

IF statement ASM and CPU branching

Just using the dissembly window in VS2012:
if(p == 7){
00344408 cmp dword ptr [p],7
0034440C jne main+57h (0344417h)
j = 2;
0034440E mov dword ptr [j],2
}
else{
00344415 jmp main+5Eh (034441Eh)
j = 3;
00344417 mov dword ptr [j],3
}
Am I correct in saying a jump table has been implemented? If so, does this still cause CPU branching problems because the assembly still has to execute the cmp command?
I am looking at the performance costs of IF statements and was wondering if the compiler optimizing to a jump-table means no more CPU branching problems.
There is no jump table here: the two jump instruction are on some absolute address:
jne main+57h (0344417h)
jmp main+5Eh (034441Eh)
There is no indirection. Using a jump table doesn't solve at all the "CPU branching problems". The branch prediction cost with or without jump table should be similar.
I wouldn't call that a jump table. A jump table is an array of destination addresses into which the index is computed dynamically from the user data on which you're switching. The code you showed is just a simple control flow with two alternative branches, with entirely statically coded control flow.
As a typical example, if (X) foo() else bar() becomes (in pseudo-code):
jump_if(!X, Label), foo(), jump(End), Label: bar(), End:
The closest way to express a jump table in pure C or C++ is using an array of function pointers.
switch constructs often become jump tables, although unlike the array of function pointers, those are indirect branch within a function instead of indirect call to a new function.

Resources