Lookup table vs switch in C embedded software - c

In another thread, I was told that a switch may be better than a lookup table in terms of speed and compactness.
So I'd like to understand the differences between this:
Lookup table
static void func1(){}
static void func2(){}
typedef enum
{
FUNC1,
FUNC2,
FUNC_COUNT
} state_e;
typedef void (*func_t)(void);
const func_t lookUpTable[FUNC_COUNT] =
{
[FUNC1] = &func1,
[FUNC2] = &func2
};
void fsm(state_e state)
{
if (state < FUNC_COUNT)
lookUpTable[state]();
else
;// Error handling
}
and this:
Switch
static void func1(){}
static void func2(){}
void fsm(int state)
{
switch(state)
{
case FUNC1: func1(); break;
case FUNC2: func2(); break;
default: ;// Error handling
}
}
I thought that a lookup table was faster since compilers try to transform switch statements into jump tables when possible.
Since this may be wrong, I'd like to know why!
Thanks for your help!

As I was the original author of the comment, I have to add a very important issue you did not mention in your question. That is, the original was about an embedded system. Presuming this is a typical bare-metal system with integrated Flash, there are very important differences from a PC on which I will concentrate.
Such embedded systems typically have the following constraints.
no CPU cache.
Flash requires waitstates for higher (i.e. >ca. 32MHz) CPU clocks. The actual ratio depends on the die design, low power/high speed process, operating voltage, etc.
To hide waitstates, Flash has wider read-lines than the CPU-bus.
This only works well for linear code with instruction prefetch.
Data accesses disturb instruction prefetch or are stalled until it finished.
Flash might have an internal very small instruction cache.
If any at all, there is an even smaller data-cache.
The small caches result in more frequent trashing (replacing a previous entry before that has been used another time).
For e.g. the STM32F4xx a read takes 6 clocks at 150MHz/3.3V for 128 bits (4 words). So if a data-access is required, chances are good it adds more than 12 clocks delay for all data to be fetched (there are additional cycles involved).
Presuming compact state-codes, for the actual problem, this has the following effects on this architecture (Cortex-M4):
Lookup-table: Reading the function address is a data-access. With all implications mentioned above.
A switch otoh uses a special "table-lookup" instruction which uses code-space data right behind the instruction. So the first entries are possibly already prefetched. Other entries don't break the prefetch. Also the access is a code-acces, thus the data goes into the Flash's instruction cache.
Also note that the switch does not need functions, thus the compiler can fully optimise the code. This is not possible for a lookup table. At least code for function entry/exit is not required.
Due to the aforementioned and other factors, an estimate is hard to tell. It heavily depends on your platform and the code structure. But assuming the system given above, the switch is very likely faster (and clearer, btw.).

First, on some processors, indirect calls (e.g. through a pointer) - like those in your Lookup Table example - are costly (pipeline breakage, TLB, cache effects). It might also be true for indirect jumps...
Then, a good optimizing compiler might inline the call to func1() in your Switch example; then you won't run any prologue or epilogue for an inlined functions.
You need to benchmark to be sure, since a lot of other factors matter on the performance. See also this (and the reference there).

Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.
(update: gcc -fpie (on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)
I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.
void fsm_switch(int state) {
switch(state) {
case FUNC0: func0(); break;
case FUNC1: func1(); break;
case FUNC2: func2(); break;
case FUNC3: func3(); break;
default: ;// Error handling
}
//prevent_tailcall();
}
void fsm_lut(state_e state) {
if (likely(state < FUNC_COUNT)) // without likely(), gcc puts the LUT on the taken side of this branch
lookUpTable[state]();
else
;// Error handling
//prevent_tailcall();
}
See also
How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?
x86
On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.
GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.
They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov, and then a memory-indirect jump with an indexed addressing mode.
ARM:
gcc 4.8.2 with -mcpu=cortex-m4 -O2 makes interesting code.
As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3). This is a normal unconditional jump, since it's a tail-call.
Each table destination entry needs significantly more code (Godbolt) if fsm_switch does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void); is declared but not defined), or if this is inlined into a larger function.
## With void prevent_tailcall(void){} defined so it can inline:
## Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
cmp r0, #3 # state,
bhi .L5 #
tbb [pc, r0] # state
## There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg. And apparently it's even loaded from I-cache, not D-cache
.byte (.L7-.L8)/2
.byte (.L9-.L8)/2
.byte (.L10-.L8)/2
.byte (.L11-.L8)/2
.L11:
b func3 # optimized tail-call
.L10:
b func2
.L9:
b func1
.L7:
b func0
.L5:
bx lr # This is ARM's equivalent of an x86 ret insn
IDK if there's much difference between how well branch prediction works for tbb vs. a full-on indirect jump or call (blx), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch.
I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.
Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.
The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:
fsm_lut:
cmp r0, #3 # state,
bhi .L13 #,
movw r3, #:lower16:.LANCHOR0 # tmp112,
movt r3, #:upper16:.LANCHOR0 # tmp112,
ldr r3, [r3, r0, lsl #2] # tmp113, lookUpTable
bx r3 # indirect register sibling call # tmp113
.L13:
bx lr #
# in the .rodata section
lookUpTable:
.word func0
.word func1
.word func2
.word func3
See Mike of SST's answer for a similar analysis on a Microchip dsPIC.

msc's answer and the comments give you good hints as to why performance may not be what you expect. Benchmarking is the rule, but results will vary from one architecture to another, and may change with other versions of the compiler and of course its configuration and options selected.
Note however that your 2 pieces of code do not perform the same validation on state:
The switch will gracefully do nothing is state is not one of the defined values,
The jump table version will invoke undefined behavior for all but the 2 values FUNC1 and FUNC2.
There is no generic way to initialize the jump table with dummy function pointers without making assumptions on FUNC_COUNT. Do get the same behavior, the jump table version should look like this:
void fsm(int state) {
if (state >= 0 && state < FUNC_COUNT && lookUpTable[state] != NULL)
lookUpTable[state]();
}
Try benchmarking this and inspect the assembly code. Here is a handy online compiler for this: http://gcc.godbolt.org/#

On the Microchip dsPIC family of devices a look-up table is stored as a set of instruction addresses in the Flash itself. Performing the look-up involves reading the address from the Flash then calling the routine. Making the call adds another handful of cycles to push the instruction pointer and other bits and bobs (e.g. setting the stack frame) of housekeeping.
For example, on the dsPIC33E512MU810, using XC16 (v1.24) the look-up code:
lookUpTable[state]();
Compiles to (from the disassembly window in MPLAB-X):
! lookUpTable[state]();
0x2D20: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D22: ADD W4, W4, W5 ; 1 cycle (addresses are 16 bit aligned)
0x2D24: MOV #0xA238, W4 ; 1 cycle (get base address of look-up table)
0x2D26: ADD W5, W4, W4 ; 1 cycle (get address of entry in table)
0x2D28: MOV [W4], W4 ; 1 cycle (get address of the function)
0x2D2A: CALL W4 ; 2 cycles (push PC+2 set PC=W4)
... and each (empty, do-nothing) function compiles to:
!static void func1()
!{}
0x2D0A: LNK #0x0 ; 1 cycle (set up stack frame)
! Function body goes here
0x2D0C: ULNK ; 1 cycle (un-link frame pointer)
0x2D0E: RETURN ; 3 cycles
This is a total of 11 instruction cycles of overhead for any of the cases, and they all take the same. (Note: If either the table or the functions it contains are not in the same 32K program word Flash page, there will be an even greater overhead due to having to get the Address Generation Unit to read from the correct page, or to set up the PC to make a long call.)
On the other hand, providing that the whole switch statement fits within a certain size, the compiler will generate code that does a test and relative branch as two instructions per case taking three (or possibly four) cycles per case up to the one that's true.
For example, the switch statement:
switch(state)
{
case FUNC1: state++; break;
case FUNC2: state--; break;
default: break;
}
Compiles to:
! switch(state)
0x2D2C: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D2E: SUB W4, #0x0, [W15] ; 1 cycle (compare with first case)
0x2D30: BRA Z, 0x2D38 ; 1 cycle (if branch not taken, or 2 if it is)
0x2D32: SUB W4, #0x1, [W15] ; 1 cycle (compare with second case)
0x2D34: BRA Z, 0x2D3C ; 1 cycle (if branch not taken, or 2 if it is)
! {
! case FUNC1: state++; break;
0x2D38: INC [W14], [W14] ; To stop the switch being optimised out
0x2D3A: BRA 0x2D40 ; 2 cycles (go to end of switch)
! case FUNC2: state--; break;
0x2D3C: DEC [W14], [W14] ; To stop the switch being optimised out
0x2D3E: NOP ; compiler did a fall-through (for some reason)
! default: break;
0x2D36: BRA 0x2D40 ; 2 cycles (go to end of switch)
! }
This is an overhead of 5 cycles if the first case is taken, 7 if the second case is taken, etc., meaning they break even on the fourth case.
This means that knowing your data at design time will have a significant influence on the long-term speed. If you have a significant number (more than about 4 cases) and they all occur with similar frequency then a look-up table will be quicker in the long run. If the frequency of the cases is significantly different (e.g. case 1 is more likely than case 2, which is more likely than case 3, etc.) then, if you order the switch with the most likely case first, then the switch will be faster in the long run. For the edge case when you only have a few cases the switch will (probably) be faster anyway for most executions and is more readable and less error prone.
If there are only a few cases in the switch, or some cases will occur more often than others, then doing the test and branch of the switch will probably take fewer cycles than using a look-up table. On the other hand, if you have more than a handful of cases of that occur with similar frequency then the look-up will probably end up being faster on average.
Tip: Go with the switch unless you know the look-up will definitely be faster and the time it takes to run is important.
Edit: My switch example is a little unfair, as I've ignored the original question and in-lined the 'body' of the cases to highlight the real advantage of using a switch over a look-up. If the switch has to do the call as well then it only has the advantage for the first case!

To have even more compiler outputs, here what is produced by the TI C28x compiler using #PeterCordes sample code:
_fsm_switch:
CMPB AL,#0 ; [CPU_] |62|
BF $C$L3,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#1 ; [CPU_] |62|
BF $C$L2,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#2 ; [CPU_] |62|
BF $C$L1,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#3 ; [CPU_] |62|
BF $C$L4,NEQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
LCR #_func3 ; [CPU_] |66|
; call occurs [#_func3] ; [] |66|
B $C$L4,UNC ; [CPU_] |66|
; branch occurs ; [] |66|
$C$L1:
LCR #_func2 ; [CPU_] |65|
; call occurs [#_func2] ; [] |65|
B $C$L4,UNC ; [CPU_] |65|
; branch occurs ; [] |65|
$C$L2:
LCR #_func1 ; [CPU_] |64|
; call occurs [#_func1] ; [] |64|
B $C$L4,UNC ; [CPU_] |64|
; branch occurs ; [] |64|
$C$L3:
LCR #_func0 ; [CPU_] |63|
; call occurs [#_func0] ; [] |63|
$C$L4:
LCR #_prevent_tailcall ; [CPU_] |69|
; call occurs [#_prevent_tailcall] ; [] |69|
LRETR ; [CPU_]
; return occurs ; []
_fsm_lut:
;* AL assigned to _state
CMPB AL,#4 ; [CPU_] |84|
BF $C$L5,HIS ; [CPU_] |84|
; branchcc occurs ; [] |84|
CLRC SXM ; [CPU_]
MOVL XAR4,#_lookUpTable ; [CPU_U] |85|
MOV ACC,AL << 1 ; [CPU_] |85|
ADDL XAR4,ACC ; [CPU_] |85|
MOVL XAR7,*+XAR4[0] ; [CPU_] |85|
LCR *XAR7 ; [CPU_] |85|
; call occurs [XAR7] ; [] |85|
$C$L5:
LCR #_prevent_tailcall ; [CPU_] |88|
; call occurs [#_prevent_tailcall] ; [] |88|
LRETR ; [CPU_]
; return occurs ; []
I also used -O2 optimizations.
We can see that the switch is not converted into a jump table even if the compiler has the ability.

Related

Is it possible to realize subroutine without indirect addressing?

I am working on the Simple-Compiler project in Deitel's book C how to program. Its main goal is to generate a compiler for an advanced language called SIMPLE and the relevant machine language is called SIMPLETRON.
I've completed some basic features for this compiler but am now stuck with an enhanced requirement -- to realize gosub and return (subroutine features) for SIMPLE language.
The main obstacle here is that SIMPLETRON doesn't support indirect addressing, which means the strategy to use stack for returning addresses of subroutines can't work. In this case, is it possible to somehow make subroutines work?
PS: I searched this issue and found an relevant question here. It seemed self-modifying code might be the answer, but I failed to find specific resolutions and thus I still raised this question. Moreover in my opinion machine instructions for SIMPLETRON has to be extended to make self-modifying code work here, right?
Background information for SIMPLETRON machine language:
It includes only one accumulator as register.
All supported machine instructions as below:
Input/output operations
#define READ 10: Read a word from the terminal into memory and with an operand as the memory address.
#define WRITE 11: Write a word from memory to the terminal and with an operand as the memory address.
Load/store operations
#define LOAD 20: Load a word from memory into the accumulator and with an operand as the memory address.
#define STORE 21: Store a word from the accumulator into memory and with an operand as the memory address.
Arithmetic operations
#define ADD 30: Add a word from memory to the word in the accumulator (leave result in accumulator) and with an operand as the
memory address.
#define SUBTRACT 31: Subtract a word ...
#define DIVIDE 32: Divide a word ...
#define MULTIPLY 33: Multiply a word ...
Transfer of control operations
#define BRANCH 40: Branch and with an operand as the code location.
#define BRANCHNEG 41: Branch if the accumulator is negative and with an operand as the code location.
#define BRANCHZERO 42: Branch if the accumulator is zero and with an operand as the code location.
#define HALT 43: End the program. No operand.
I'm not familiar with SIMPLE or SIMPLETRON, but in general I can think of at least 3 approaches.
Self-modifying code
Have a BRANCH 0 instruction at the end of each subroutine, and before that, code to load the return address into the accumulator and STORE it into the code itself, thus effectively forming a BRANCH <dynamic> instruction.
Static list of potential callers
If SIMPLE doesn't have indirect calls (i.e. every gosub targets a statically known subroutine), then the compiler knows the list of possible callers of each subroutine. Then it could have each call pass a unique argument (e.g. in the accumulator), which the subroutine can test (pseudocode):
SUBROUTINE:
...
if (arg == 0)
branch CALLER_1;
if (arg == 1)
branch CALLER_2;
if (arg == 2)
branch CALLER_3;
Inlining
If SIMPLE doesn't allow recursive subroutines, there's no need to implement calls at the machine code level at all. Simply inline every subroutine into its caller completely.
Yes, you can do this, even reasonably, without self-modifying code.
You turn your return addresses into a giant case statement.
The secret is understanding that a "return address" is just a way
to get back to point of the call, and that memory is just a giant
array of named locations.
Imagine I have a program with many logical call locations, with the instruction
after the call labelled:
CALL S
$1: ...
...
CALL T
$2: ...
...
CALL U
$3: ...
We need to replace the CALLs with something our machine can implement.
Let's also assume temporarily that only one subroutine call is active at any moment.
Then all that matters, is that after a subroutine completes, that control
returns to the point after the call.
You can cause this by writing the following SIMPLETRON code (I'm making up the syntax). By convention I assume I have a bunch of memory locations K1, K2, ... that contain the constants 1, 2, .. etc for as many constants as a I need.
K1: 1
K2: 2
K3: 3
...
LOAD K1
JMP S
$1: ...
...
LOAD K2
JMP T
$2: ...
...
LOAD K3
JMP U
$3:....
S: STORE RETURNID
...
JMP RETURN
T: STORE RETURNID
...
JMP RETURN
U: STORE RETURNID
...
JMP RETURN
RETURN: LOAD RETURNID
SUB K1
JE $1
LOAD RETURNID
SUB K2
JE $2
LOAD RETURNID
SUB K3
JE $3
JMP * ; bad return address, just hang
In essence, each call site records a constant (RETURNID) unique to that call site, and "RETURN" logic uses that unique ID to figure out the return point. If you have a lot of subroutines, the return logic code might be quite long, but hey, this is a toy machine and we aren't that interested in efficiency.
You could always make the return logic into a binary decision tree; then
the code might be long but it would only take log2(callcount) to decide how to get back, not actually all that bad).
Let's relax our assumption of only one subroutine active at any moment.
You can define for each subroutine a RETURNID, but still use the same RETURN code. With this idea, any subroutine can call any other subroutine. Obviously these routines are not-reentrant, so they can't be called more than once in any call chain.
We can use this same idea to implement a return stack. The trick is to recognize that a stack is merely a set of memory locations with an address decoder that picks out members of the stack. So, lets implement
PUSH and POP instructions as subroutines. We change our calling convention
to make the caller record the RETURNID, leaving the accumulator free
to pass a value:
LOAD K1
STORE PUSHRETURNID
LOAD valuetopush
JMP PUSH
$1:
LOAD K2
STORE POPRETURNID
JMP POP
$2:...
TEMP:
STACKINDEX: 0 ; incremented to 1 on first use
STACK1: 0 ; 1st stack location
...
STACKN: 0
PUSH: STORE TEMP ; save value to push
LOAD PUSHRETURNID ; do this here once instead of in every exit
STORE RETURNID
LOAD STACKINDEX ; add 1 to SP here, once, instead of in every exit
ADD K1
STORE STACKINDEX
SUB K1
JE STORETEMPSTACK1
LOAD STACKINDEX
SUB K2
JE STORETEMPSTACK2
...
LOAD STACKINDEX
SUB Kn
JE STORETEMPSTACKn
JMP * ; stack overflow
STORETEMPSTACK1:
LOAD TEMP
STORE STACK1
JMP RETURN
STORETEMPSTACK2:
LOAD TEMP
STORE STACK2
JMP RETURN
...
POP: LOAD STACKINDEX
SUB K1 ; decrement SP here once, rather than in every exit
STORE STACKINDEX
LOAD STACKINDEX
SUB K0
JE LOADSTACK1
LOAD STACKINDEX
SUB K1
JE LOADSTACK2
...
LOADSTACKn:
LOAD STACKn
JMP POPRETURN
LOADSTACK1:
LOAD STACK1
JMP RETURNFROMPOP
LOADSTACK2:
LOAD STACK2
JMP RETURNFROMPOP
RETURNFROMPOP: STORE TEMP
LOAD POPRETURNID
SUB K1
JE RETURNFROMPOP1
LOAD POPRETURNID
SUB K2
JE RETURNFROMPOP2
...
RETURNFROMPOP1: LOAD TEMP
JMP $1
RETURNFROMPOP2: LOAD TEMP
JMP $2
Note that we need RETURN, to handle returns with no value, and RETURNFROMPOP, that handles returns from the POP subroutine with a value.
So these look pretty clumsy, but we can now realize a pushdown stack
of fixed but arbitrarily large depth. If we again make binary decision trees out the stack location and returnID checking, the runtime costs are only logarithmic in the size of the stacks/call count, which is actually pretty good.
OK, now we have general PUSH and POP subroutines. Now we can make calls that store the return address on the stack:
LOAD K1 ; indicate return point
STORE PUSHRETURNID
LOAD K2 ; call stack return point
JMP PUSH
$1: LOAD argument ; a value to pass to the subroutine
JMP RECURSIVESUBROUTINEX
; returns here with subroutine result in accumulator
$2:
RECURSIVESUBROUTINEX:
...compute on accumulator...
LOAD K3 ; indicate return point
STORE PUSHRETURNID
LOAD K4 ; call stack return point
JMP PUSH
$3: LOAD ... ; some revised argument
JMP RECURSIVESUBROUTINEX
$4: ; return here with accumulator containing result
STORE RECURSIVESUBROUTINERESULT
LOAD K5
STORE POPRETURNID
JMP POP
$5: ; accumulator contains return ID
STORE POPRETURNID
LOAD RECURSIVESUBROUTINERESULT
JMP RETURNFROMPOP
That's it. Now you have fully recursive subroutine calls with a stack, with no (well, faked) indirection.
I wouldn't want to program this machine manually because building the RETURN routines would be a royal headache to code and keep right. But a compiler would be perfectly happy to manufacture all this stuff.
Although there's no way to get the current instruction's location from within the SIMPLE instruction set, the assembler can keep track of instruction locations in order to generate the equivalent of return instructions.
The assembler would generate a branch to address instruction in the program image to be used as a return instruction, then to implement a call it would generate code to load a "return instruction" and store it at the end of a subroutine before branching to that subroutine. Each instance of a "call" would require an instance of a "return instruction" in the program image. You may want to reserve a range of variable memory to store these "return instructions".
Example "code" using a call that includes the label of the return instruction as a parameter:
call sub1, sub1r
; ...
sub1: ; ...
sub1r: b 0 ;this is the return instruction
Another option would be something similar to MASM PROC and ENDP, where the ENDP would hold the return instruction. The call directive would assume that the endp direction holds the branch to be modified and the label would be the same as the corresponding proc directive.
call sub1
; ...
sub1 proc ;subroutine entry point
; ...
sub1 endp ;subroutine end point, "return" stored here
The issue here is that the accumulator would be destroyed by the "call" (but not affected by the "return"). If needed, subroutine parameters could be stored as variables, perhaps using assembler directives for labeling:
sub1 parm1 ;parameter 1 for sub1
;....
load sub1.parm1 ;load sub1 parameter 1

Precise delays on Arduino using nop assembly?

I'm looking to make a very short pulse after a rising edge signal input.
The hard part here is that I would like to control (to high resolution) the timing of the delay before my pulse, and the duration of my pulse. I can easily control this by just stringing together nops by myself, hard coding delays, but I'm not sure how to do it for some arbitrary delay, with the same level of accuracy.
After a lot of headaches chasing down timers, and then eventually realizing I am ultimately limited by the interrupt routine entry/exit time, I am now settling at trying to control my delay via nops.
I had assumed this C switch statement would be what I wanted (after compiling, hoping it would become efficient and just change the program counter to the right spot), but it produces some very odd behavior...
switch(delayTime){
case 10:
__asm__ __volatile__("nop");
case 9:
__asm__ __volatile__("nop");
case 8:
__asm__ __volatile__("nop");
case 7:
__asm__ __volatile__("nop");
case 6:
__asm__ __volatile__("nop");
case 5:
__asm__ __volatile__("nop");
case 4:
__asm__ __volatile__("nop");
case 3:
__asm__ __volatile__("nop");
case 2:
__asm__ __volatile__("nop");
case 1:
__asm__ __volatile__("nop");
}
PORTD = 0x10;
...
Ideally, I would like to essentially run through some code that would compile into this: (it's some weird pseudocode of C and assembly, still not sure how to do some of it in assembly)
0x005 Reg1 = 0xFF-val1 %(where somehow 0xFF is known? / found out?)
0x006 Reg2 =0x1FF-val2
0x007 IJMP Reg1
0x008 NOP
0x009 NOP
0x00A NOP
...
0x0FF MOV 0x40, PORTD % assign the value 0x40 to the static variable "PORTD"
0x100 IJMP Reg2
0x101 NOP
0x102 NOP
0x103 NOP
0x104 NOP
...
0x1FF MOV 0x00, PORTD % assign the value 0x00 to the static variable "PORTD"
I'm just overall not sure how to find the memory location for the code after/during run time so that the "0xFF" and "0x1FF" aspects of this program are not really so bad (it seems like it's super dangerous to just, get the assembly of the code, and then hard code that in... I'd rather not do that). Also, while it's easy to just flood it with the 200+ nops, how to get the IJMP cmd to behave the way I want it to? (I honestly don't even know if that's the command I want)..
I guess in general I'm looking for some assembly command (that I can't seem to find) that allows me to "add N to Program Counter" and I can just make sure that that command is run in assembly with at least N+1 commands of assembly ahead of it, hardcoded in.
As a side note, all of this is executing inside of an interrupt routine, so I don't feel so bad about playing around with the PC... Also, I know is kinda bad blocking for up to 500 operations, but for the task at hand, timing is more important than how badly it blocks as a routine.
I'm not familiar with the AVR instruction set, but the general idea is to use the CALL instruction to put the program counter (PC) on the stack. Then use POP to move the PC to the Z register. Then you can ADD some number to the Z register, and use IJMP to jump to the resulting address.
So something along these lines
delay: call delay1 ; push the PC onto the stack
delay1: pop r30 ; pop the PC into the Z registers
pop r31
add r30,r0 ; add some amount to the PC value
addc r31,r1
ijmp ; use IJMP to jump to the resulting address
nop
nop
nop
...
Random thoughts:
On the 8MB machines, you need a third pop to remove the third byte of
the PC from the stack.
Z is only sixteen bits, therefore this code must be in the first
128KB of program memory.
I'm not sure which register (r30 or r31) is supposed to be popped
first.
The value added to Z must be relative to delay1 since call is
going to push the address of delay1 onto the stack. In other words,
the minimum amount that needs to be added is 6, since that's the
number of instructions from delay1 to the first nop.
The minimum delay is determined by the six instructions up to and
including the ijmp. You should increase r1/r0 (reduce the number of
nops) accordingly.
Like I said, I'm no expert on the AVR instruction set, so you should take this as a general suggestion, and be prepared to spend some time working out the particulars. Good luck!

IF statement ASM and CPU branching

Just using the dissembly window in VS2012:
if(p == 7){
00344408 cmp dword ptr [p],7
0034440C jne main+57h (0344417h)
j = 2;
0034440E mov dword ptr [j],2
}
else{
00344415 jmp main+5Eh (034441Eh)
j = 3;
00344417 mov dword ptr [j],3
}
Am I correct in saying a jump table has been implemented? If so, does this still cause CPU branching problems because the assembly still has to execute the cmp command?
I am looking at the performance costs of IF statements and was wondering if the compiler optimizing to a jump-table means no more CPU branching problems.
There is no jump table here: the two jump instruction are on some absolute address:
jne main+57h (0344417h)
jmp main+5Eh (034441Eh)
There is no indirection. Using a jump table doesn't solve at all the "CPU branching problems". The branch prediction cost with or without jump table should be similar.
I wouldn't call that a jump table. A jump table is an array of destination addresses into which the index is computed dynamically from the user data on which you're switching. The code you showed is just a simple control flow with two alternative branches, with entirely statically coded control flow.
As a typical example, if (X) foo() else bar() becomes (in pseudo-code):
jump_if(!X, Label), foo(), jump(End), Label: bar(), End:
The closest way to express a jump table in pure C or C++ is using an array of function pointers.
switch constructs often become jump tables, although unlike the array of function pointers, those are indirect branch within a function instead of indirect call to a new function.

possible to do if (!boolvar) { ... in 1 asm instruction?

This question is more out of curiousity than necessity:
Is it possible to rewrite the c code if ( !boolvar ) { ... in a way so it is compiled to 1 cpu instruction?
I've tried thinking about this on a theoretical level and this is what I've come up with:
if ( !boolvar ) { ...
would need to first negate the variable and then branch depending on that -> 2 instructions (negate + branch)
if ( boolvar == false ) { ...
would need to load the value of false into a register and then branch depending on that -> 2 instructions (load + branch)
if ( boolvar != true ) { ...
would need to load the value of true into a register and then branch ("branch-if-not-equal") depending on that -> 2 instructions (load + "branch-if-not-equal")
Am I wrong with my assumptions? Is there something I'm overlooking?
I know I can produce intermediate asm versions of programs, but I wouldn't know how to use this in a way so I can on one hand turn on compiler optimization and at the same time not have an empty if statement optimized away (or have the if statement optimized together with its content, giving some non-generic answer)
P.S.: Of course I also searched google and SO for this, but with such short search terms I couldn't really find anything useful
P.P.S.: I'd be fine with a semantically equivalent version which is not syntactical equivalent, e.g. not using if.
Edit: feel free to correct me if my assumptions about the emitted asm instructions are wrong.
Edit2: I've actually learned asm about 15yrs ago, and relearned it about 5yrs ago for the alpha architecture, but I hope my question is still clear enough to figure out what I'm asking. Also, you're free to assume any kind of processor extension common in consumer cpus up to AVX2 (current haswell cpu as of the time of writing this) if it helps in finding a good answer.
At the end of my post it will say why you should not aim for this behaviour (on x86).
As Jerry Coffin has written, most jumps in x86 depend on the flags register.
There is one exception though: The j*cxz set of instructions which jump if the ecx/rcx register is zero. To achieve this you need to make sure that your boolvar uses the ecx register. You can achieve that by specifically assigning it to that register
register int boolvar asm ("ecx");
But by far not all compilers use the j*cxz set of instructions. There is a flag for icc to make it do that, but it is generally not advisable. The Intel manual states that two instructions
test ecx, ecx
jz ...
are faster on the processor.
The reason for being this is that x86 is a CISC (complex) instruction set. In the actual hardware though the processor will split up complex instructions that appear as one instruction in the asm into multiple microinstructions which are then executed in a RISC style. This is the reason why not all instructions require the same execution time and sometimes multiple small ones are faster then one big one.
test and jz are single microinstructions, but jecxz will be decomposed into those two anyways.
The only reason why the j*cxz set of instructions exist is if you want to make a conditional jump without modifying the flags register.
Yes, it's possible -- but doing so will depend on the context in which this code takes place.
Conditional branches in an x86 depend upon the values in the flags register. For this to compile down to a single instruction, some other code will already need to set the correct flag, so all that's left is a single instruction like jnz wherever.
For example:
boolvar = x == y;
if (!boolvar) {
do_something();
}
...could end up rendered as something like:
mov eax, x
cmp eax, y ; `boolvar = x == y;`
jz #f
call do_something
##:
Depending on your viewpoint, it could even compile down to only part of an instruction. For example, quite a few instructions can be "predicated", so they're executed only if some previously defined condition is true. In this case, you might have one instruction for setting "boolvar" to the correct value, followed by one to conditionally call a function, so there's no one (complete) instruction that corresponds to the if statement itself.
Although you're unlikely to see it in decently written C, a single assembly language instruction could include even more than that. For an obvious example, consider something like:
x = 10;
looptop:
-- x;
boolvar = x == 0;
if (!boolvar)
goto looptop;
This entire sequence could be compiled down to something like:
mov ecx, 10
looptop:
loop looptop
Am I wrong with my assumptions
You are wrong with several assumptions. First you should know that 1 instruction is not necessarily faster than multiple ones. For example in newer μarchs test can macro-fuse with jcc, so 2 instructions will run as one. Or a division is so slow that in the same time tens or hundreds of simpler instructions may already finished. Compiling the if block to a single instruction doesn't worth it if it's slower than multiple instructions
Besides, if ( !boolvar ) { ... doesn't need to first negate the variable and then branch depending on that. Most jumps in x86 are based on flags, and they have both the yes and no conditions, so no need to negate the value. We can simply jump on non-zero instead of jump on zero
Similarly if ( boolvar == false ) { ... doesn't need to load the value of false into a register and then branch depending on that. false is a constant equal to 0, which can be embedded as an immediate in the instruction (like cmp reg, 0). But for checking against zero then just a simple test reg, reg is enough. Then jnz or jz will be used to jump on zero/non-zero, which will be fused with the previous test instruction into one
It's possible to make an if header or body that compiles to a single instruction, but it depends entirely on what you need to do, and what condition is used. Because the flag for boolvar may already be available from the previous statement, so the if block in the next line can use it to jump directly like what you see in Jerry Coffin's answer
Moreover x86 has conditional moves, so if inside the if is a simple assignment then it may be done in 1 instruction. Below is an example and its output
int f(bool condition, int x, int y)
{
int ret = x;
if (!condition)
ret = y;
return ret;
}
f(bool, int, int):
test dil, dil ; if(!condition)
mov eax, edx ; ret = y
cmovne eax, esi ; if(condition) ret = x
ret
Some other cases you don't even need a conditional move or jump. For example
bool f(bool condition)
{
bool ret = false;
if (!condition)
ret = true;
return ret;
}
compiles to a single xor without any jump at all
f(bool):
mov eax, edi
xor eax, 1
ret
ARM architecture (v7 and below) can run any instruction as conditional so that may translate to only one instruction
For example the following loop
while (i != j)
{
if (i > j)
{
i -= j;
}
else
{
j -= i;
}
}
can be translated to ARM assembly as
loop: CMP Ri, Rj ; set condition "NE" if (i != j),
; "GT" if (i > j),
; or "LT" if (i < j)
SUBGT Ri, Ri, Rj ; if "GT" (Greater Than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (Less Than), j = j-i;
BNE loop ; if "NE" (Not Equal), then loop

Using GCC inline assembly with instructions that take immediate values

The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?
You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints
What about using a macro:
#define SVC(i) asm volatile("svc #"#i)
As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.
My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.
As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.
I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}

Resources