Using GCC inline assembly with instructions that take immediate values - c

The problem
I'm working on a custom OS for an ARM Cortex-M3 processor. To interact with my kernel, user threads have to generate a SuperVisor Call (SVC) instruction (previously known as SWI, for SoftWare Interrupt). The definition of this instruction in the ARM ARM is:
Which means that the instruction requires an immediate argument, not a register value.
This is making it difficult for me to architect my interface in a readable fashion. It requires code like:
asm volatile( "svc #0");
when I'd much prefer something like
svc(SVC_YIELD);
However, I'm at a loss to construct this function, because the SVC instruciton requires an immediate argument and I can't provide that when the value is passed in through a register.
The kernel:
For background, the svc instruction is decoded in the kernel as follows
#define SVC_YIELD 0
// Other SVC codes
// Called by the SVC interrupt handler (not shown)
void handleSVC(char code)
{
switch (code) {
case SVC_YIELD:
svc_yield();
break;
// Other cases follow
This case statement is getting rapidly out of hand, but I see no way around this problem. Any suggestions are welcome.
What I've tried
SVC with a register argument
I initially considered
__attribute__((naked)) svc(char code)
{
asm volatile ("scv r0");
}
but that, of course, does not work as SVC requires a register argument.
Brute force
The brute-force attempt to solve the problem looks like:
void svc(char code)
switch (code) {
case 0:
asm volatile("svc #0");
break;
case 1:
asm volatile("svc #1");
break;
/* 253 cases omitted */
case 255:
asm volatile("svc #255");
break;
}
}
but that has a nasty code smell. Surely this can be done better.
Generating the instruction encoding on the fly
A final attempt was to generate the instruction in RAM (the rest of the code is running from read-only Flash) and then run it:
void svc(char code)
{
asm volatile (
"orr r0, 0xDF00 \n\t" // Bitwise-OR the code with the SVC encoding
"push {r1, r0} \n\t" // Store the instruction to RAM (on the stack)
"mov r0, sp \n\t" // Copy the stack pointer to an ordinary register
"add r0, #1 \n\t" // Add 1 to the address to specify THUMB mode
"bx r0 \n\t" // Branch to newly created instruction
"pop {r1, r0} \n\t" // Restore the stack
"bx lr \n\t" // Return to caller
);
}
but this just doesn't feel right either. Also, it doesn't work - There's something I'm doing wrong here; perhaps my instruction isn't properly aligned or I haven't set up the processor to allow running code from RAM at this location.
What should I do?
I have to work on that last option. But still, it feels like I ought to be able to do something like:
__attribute__((naked)) svc(char code)
{
asm volatile ("scv %1"
: /* No outputs */
: "i" (code) // Imaginary directive specifying an immediate argument
// as opposed to conventional "r"
);
}
but I'm not finding any such option in the documentation and I'm at a loss to explain how such a feature would be implemented, so it probably doesn't exist. How should I do this?

You want to use a constraint to force the operand to be allocated as an 8-bit immediate. For ARM, that is constraint I. So you want
#define SVC(code) asm volatile ("svc %0" : : "I" (code) )
See the GCC documentation for a summary of what all the constaints are -- you need to look at the processor-specific notes to see the constraints for specific platforms. In some cases, you may need to look at the .md (machine description) file for the architecture in the gcc source for full information.
There's also some good ARM-specific gcc docs here. A couple of pages down under the heading "Input and output operands" it provides a table of all the ARM constraints

What about using a macro:
#define SVC(i) asm volatile("svc #"#i)

As noted by Chris Dodd in the comments on the macro, it doesn't quite work, but this does:
#define STRINGIFY0(v) #v
#define STRINGIFY(v) STRINGIFY0(v)
#define SVC(i) asm volatile("svc #" STRINGIFY(i))
Note however that it won't work if you pass an enum value to it, only a #defined one.
Therefore, Chris' answer above is the best, as it uses an immediate value, which is what's required, for thumb instructions at least.

My solution ("Generating the instruction encoding on the fly"):
#define INSTR_CODE_SVC (0xDF00)
#define INSTR_CODE_BX_LR (0x4770)
void svc_call(uint32_t svc_num)
{
uint16_t instrs[2];
instrs[0] = (uint16_t)(INSTR_CODE_SVC | svc_num);
instrs[1] = (uint16_t)(INSTR_CODE_BX_LR);
// PC = instrs (or 1 -> thumb mode)
((void(*)(void))((uint32_t)instrs | 1))();
}
It works and its much better than switch-case variant, which takes ~2kb ROM for 256 svc's. This func does not have to be placed in RAM section, FLASH is ok.
You can use it if svc_num should be a runtime variable.

As discussed in this question, the operand of SVC is fixed, that is it should be known to the preprocessor, and it is different from immediate Data-processing operands.
The gcc manual reads
'I'- Integer that is valid as an immediate operand in a data processing instruction. That is, an integer in the range 0 to 255 rotated by a multiple of 2.
Therefore the answers here that use a macro are preferred, and the answer of Chris Dodd is not guaranteed to work, depending on the gcc version and optimization level. See the discussion of the other question.

I wrote one handler recently for my own toy OS on Cortex-M. Works if tasks use PSP pointer.
Idea:
Get interrupted process's stack pointer, get process's stacked PC, it will have the instruction address of instruction after SVC, look up the immediate value in the instruction. It's not as hard as it sounds.
uint8_t __attribute__((naked)) get_svc_code(void){
__asm volatile("MSR R0, PSP"); //Get Process Stack Pointer (We're in SVC ISR, so currently MSP in use)
__asm volatile("ADD R0, #24"); //Pointer to stacked process's PC is in R0
__asm volatile("LDR R1, [R0]"); //Instruction Address after SVC is in R1
__asm volatile("SUB R1, R1, #2"); //Subtract 2 bytes from the address of the current instruction. Now R1 contains address of SVC instruction
__asm volatile("LDRB R0, [R1]"); //Load lower byte of 16-bit instruction into R0. It's immediate value.
//Value is in R0. Function can return
}

Related

Volatile variable not updated despite unoptimized assembly

I'm working on a dual-core Cortex-R52 ARM chip, with an instance of FreeRTOS running in each core (AMP), and using ICCARM (IAR) as my compiler.
I need to ensure that CPU1 initialize some tasks, in order to pass their handler to CPU0 through the shared memory, but both cores are executed at the same time, which creates a problem in the scenario where CPU0 gets to using the supposedly passed handler, that wasn't created yet by CPU1.
A solution I tried, was creating a volatile variable pdSTART at a dedicated address space, which keeps CPU0 looping as long as its equal to 0:
#pragma location = 0x100F900C
__no_init volatile uint8_t pdSTART;
while (pdSTART == 0)
{
vTaskDelay(10 / portTICK_PERIOD_MS);
}
As expected the generated assembly was as follows:
vTaskDelay(10 / portTICK_PERIOD_MS);
0xc3a: 0x200a MOVS R0, #10 ; 0xa
0xc3c: 0xf000 0xf93c BL vTaskDelay ; 0xeb8
while (pdSTART == 0)
0xc40: 0x7b28 LDRB R0, [R5, #0xc]
0xc42: 0x2800 CMP R0, #0
0xc44: 0xd0f9 BEQ.N 0xc3a
With register R5 containing the address 0x100F9000.
Using the debugger I made sure CPU0 reaches the while condition first and gets in the loop, I then made CPU1 change the value of pdSTART, which I confirmed on the memory map
pdSTART:
0x100f'900c: 0x0000'0001 DC32 VECTOR_RBLOCK$$Base
And yet the condition on CPU0 remains false and pdSTART is never updated, both the memory map and "Watch" window of the debugger show the variable updated.
I tried explicitly writing a read from the address of pdSTART:
void func(void)
{
asm volatile ("" : : "r" (*(uint8_t *)0x100F900C));
}
But the generated assembly was the same as the while condition.
Is the old value of pdSTART saved into some kind of stack or cache? is there a way to forcefully update it?
Thank you.

Why doesn't my ARM LDREX/STREX C function work?

I wrote a claim_lock function in C, according to the "Barrier Litmus Tests and Cookbook" document. I examined the generated code, and it all looks good, but it didn't work.
// This code conforms to the section 7.2 of PRD03-GENC-007826:
// "Acquiring and Releasing a Lock"
static inline void claim_lock( uint32_t volatile *lock )
{
uint32_t failed = 1;
uint32_t value;
while (failed) {
asm volatile ( "ldrex %[value], [%[lock]]"
: [value] "=&r" (value)
: [lock] "r" (lock) );
if (value == 0) {
// The failed and lock registers are not allowed to be the same, so
// pretend to gcc that the lock pointer may be written as well as read.
asm volatile ( "strex %[failed], %[value], [%[lock]]"
: [failed] "=&r" (failed)
, [lock] "+r" (lock)
: [value] "r" (1) );
}
else {
asm ( "clrex" );
}
}
asm ( "dmb sy" );
}
Generated code (gcc):
1000: e3a03001 mov r3, #1
1004: e1902f9f ldrex r2, [r0]
1008: e3520000 cmp r2, #0
100c: 1a000004 bne 1024 <claim_lock+0x24>
1010: e1802f93 strex r2, r3, [r0]
1014: e3520000 cmp r2, #0
1018: 1afffff9 bne 1004 <claim_lock+0x4>
101c: f57ff05f dmb sy
1020: e12fff1e bx lr
1024: f57ff01f clrex
1028: eafffff5 b 1004 <claim_lock+0x4>
Corresponding release function:
static inline void release_lock( uint32_t volatile *lock )
{
// Ensure that any changes made while holding the lock are
// visible before the lock is seen to have been released
asm ( "dmb sy" );
*lock = 0;
}
It worked in QEMU, but either hung, or allowed all cores to "claim" the so-called "lock" on real hardware (Raspberry Pi 3 Cortex-A53).
this is what i found in Context switch section of ARMv7-M Architecture
Reference Manual
Blockquote
It is necessary to ensure that the local monitor is in the Open Access state after a context switch. In
ARMv7-M, the local monitor is changed to Open Access automatically as part of an exception entry or exit
sequence. The local monitor can also be forced to the Open Access state by a CLREX instruction.
Note
Context switching is not an application level operation. However, this information is included here to
complete the description of the exclusive operations.
A context switch might cause a subsequent Store-Exclusive to fail, requiring a load … store sequence to be
replayed. To minimize the possibility of this happening, ARM recommends that the Store-Exclusive
instruction is kept as close as possible to the associated Load-Exclusive instruction, see Load-Exclusive and
Store-Exclusive usage restrictions.
Blockquote
The LDREX instruction will hang the core (unless my test failed to report an exception) if:
The MMU is not enabled
The virtual memory area containing the lock is not cached
The cores will appear to ignore each other's claims if:
Symmetric Multi-processing has not been enabled
The SMP enable mechanism seems to vary from device to device; check the TRM for the partular core, it's outside the scope of the ARM ARM.
For the Cortex-A53, the bit to set is SMPEN, bit 6 of The CPU Extended Control Register, CPUECTLR.
Earlier devices have bit 5 of the Auxiliary Control Register, for example (ARM11 MPcore), where there's also the SCU to consider. I don't have such a device, but it's that documentation where I first noticed an SMP/nAMP bit.

Mixing C and assembly and its impact on registers

Consider the following C and (ARM) assembly snippet, which is to be compiled with GCC:
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 q12, #0.0\n\t"
: [data_addr] "+r" (data_addr)
: : "q0", "q12");
for(int n=0; n<10; ++n){
__asm__ __volatile__ (
"vadd.f32 q12, q12, q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
:: "q0", "q12");
}
In this example, I am initialising some SIMD registers outside the loop and then having C handle the loop logic, with those initialised registers being used inside the loop.
This works in some test code, but I'm concerned of the risk of the compiler clobbering the registers between snippets. Is there any way of ensuring this doesn't happen? Can I infer any assurances about the type of registers that are going to be used in a snippet (in this case, that no SIMD registers will be clobbered)?
In general, there's not a way to do this in gcc; clobbers only guarantee that registers will be preserved around the asm call. If you need to ensure that the registers are saved between two asm sections, you will need to store them to memory in the first, and reload in the second.
Edit: After much fiddling around I've come to the conclusion this is much harder to solve in general using the strategy described below than I initially thought.
The problem is that, particularly when all the registers are used, there is nothing to stop the first register stash from overwriting another. Whether there is some trick to play with using direct memory writes that can be optimised away I don't know, but initial tests would suggest the compiler might still choose to clobber not-yet-stashed registers
For the time being and until I have more information, I'm unmarking this answer as correct and this answer should treated as probably wrong in the general case. My conclusion is this that such local protection of registers needs better support in the compiler to be useful
This absolutely is possible to do reliably. Drawing on the comments by #PeterCordes as well as the docs and a couple of useful bug reports (gcc 41538 and 37188) I came up with the following solution.
The point that makes it valid is the use of temporary variables to make sure the registers are maintained (logically, if the loop clobbers them, then they will be reloaded). In practice, the temporary variables are optimised away which is clear from inspection of the resultant asm.
// d0 and d1 map to the first and second values of q0, so we use
// q0 to reduce the number of tmp variables we pass around (instead
// of using one for each of d0 and d1).
register float32x4_t data __asm__ ("q0");
register float32x4_t output __asm__ ("q12");
float32x4_t tmp_data;
float32x4_t tmp_output;
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 %q[output], #0.0\n\t"
: [data_addr] "+r" (data_addr),
[output] "=&w" (output),
"=&w" (data) // we still need to constrain data (q0) as written to.
::);
// Stash the register values
tmp_data = data;
tmp_output = output;
for(int n=0; n<10; ++n){
// Make sure the registers are loaded correctly
output = tmp_output;
data = tmp_data;
__asm__ __volatile__ (
"vadd.f32 %[output], %[output], q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
[output] "+w" (output),
"+w" (data) // again, data (q0) was written to in the vldmia op.
::);
// Remember to stash the registers again before continuing
tmp_data = data;
tmp_output = output;
}
It's necessary to instruct the compiler that q0 is written to in the last line of each asm output constraint block, so it doesn't think it can reorder the stashing and reloading of the data register resulting in the asm block getting invalid values.

Lookup table vs switch in C embedded software

In another thread, I was told that a switch may be better than a lookup table in terms of speed and compactness.
So I'd like to understand the differences between this:
Lookup table
static void func1(){}
static void func2(){}
typedef enum
{
FUNC1,
FUNC2,
FUNC_COUNT
} state_e;
typedef void (*func_t)(void);
const func_t lookUpTable[FUNC_COUNT] =
{
[FUNC1] = &func1,
[FUNC2] = &func2
};
void fsm(state_e state)
{
if (state < FUNC_COUNT)
lookUpTable[state]();
else
;// Error handling
}
and this:
Switch
static void func1(){}
static void func2(){}
void fsm(int state)
{
switch(state)
{
case FUNC1: func1(); break;
case FUNC2: func2(); break;
default: ;// Error handling
}
}
I thought that a lookup table was faster since compilers try to transform switch statements into jump tables when possible.
Since this may be wrong, I'd like to know why!
Thanks for your help!
As I was the original author of the comment, I have to add a very important issue you did not mention in your question. That is, the original was about an embedded system. Presuming this is a typical bare-metal system with integrated Flash, there are very important differences from a PC on which I will concentrate.
Such embedded systems typically have the following constraints.
no CPU cache.
Flash requires waitstates for higher (i.e. >ca. 32MHz) CPU clocks. The actual ratio depends on the die design, low power/high speed process, operating voltage, etc.
To hide waitstates, Flash has wider read-lines than the CPU-bus.
This only works well for linear code with instruction prefetch.
Data accesses disturb instruction prefetch or are stalled until it finished.
Flash might have an internal very small instruction cache.
If any at all, there is an even smaller data-cache.
The small caches result in more frequent trashing (replacing a previous entry before that has been used another time).
For e.g. the STM32F4xx a read takes 6 clocks at 150MHz/3.3V for 128 bits (4 words). So if a data-access is required, chances are good it adds more than 12 clocks delay for all data to be fetched (there are additional cycles involved).
Presuming compact state-codes, for the actual problem, this has the following effects on this architecture (Cortex-M4):
Lookup-table: Reading the function address is a data-access. With all implications mentioned above.
A switch otoh uses a special "table-lookup" instruction which uses code-space data right behind the instruction. So the first entries are possibly already prefetched. Other entries don't break the prefetch. Also the access is a code-acces, thus the data goes into the Flash's instruction cache.
Also note that the switch does not need functions, thus the compiler can fully optimise the code. This is not possible for a lookup table. At least code for function entry/exit is not required.
Due to the aforementioned and other factors, an estimate is hard to tell. It heavily depends on your platform and the code structure. But assuming the system given above, the switch is very likely faster (and clearer, btw.).
First, on some processors, indirect calls (e.g. through a pointer) - like those in your Lookup Table example - are costly (pipeline breakage, TLB, cache effects). It might also be true for indirect jumps...
Then, a good optimizing compiler might inline the call to func1() in your Switch example; then you won't run any prologue or epilogue for an inlined functions.
You need to benchmark to be sure, since a lot of other factors matter on the performance. See also this (and the reference there).
Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.
(update: gcc -fpie (on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)
I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.
void fsm_switch(int state) {
switch(state) {
case FUNC0: func0(); break;
case FUNC1: func1(); break;
case FUNC2: func2(); break;
case FUNC3: func3(); break;
default: ;// Error handling
}
//prevent_tailcall();
}
void fsm_lut(state_e state) {
if (likely(state < FUNC_COUNT)) // without likely(), gcc puts the LUT on the taken side of this branch
lookUpTable[state]();
else
;// Error handling
//prevent_tailcall();
}
See also
How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?
x86
On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.
GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.
They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov, and then a memory-indirect jump with an indexed addressing mode.
ARM:
gcc 4.8.2 with -mcpu=cortex-m4 -O2 makes interesting code.
As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3). This is a normal unconditional jump, since it's a tail-call.
Each table destination entry needs significantly more code (Godbolt) if fsm_switch does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void); is declared but not defined), or if this is inlined into a larger function.
## With void prevent_tailcall(void){} defined so it can inline:
## Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
cmp r0, #3 # state,
bhi .L5 #
tbb [pc, r0] # state
## There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg. And apparently it's even loaded from I-cache, not D-cache
.byte (.L7-.L8)/2
.byte (.L9-.L8)/2
.byte (.L10-.L8)/2
.byte (.L11-.L8)/2
.L11:
b func3 # optimized tail-call
.L10:
b func2
.L9:
b func1
.L7:
b func0
.L5:
bx lr # This is ARM's equivalent of an x86 ret insn
IDK if there's much difference between how well branch prediction works for tbb vs. a full-on indirect jump or call (blx), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch.
I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.
Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.
The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:
fsm_lut:
cmp r0, #3 # state,
bhi .L13 #,
movw r3, #:lower16:.LANCHOR0 # tmp112,
movt r3, #:upper16:.LANCHOR0 # tmp112,
ldr r3, [r3, r0, lsl #2] # tmp113, lookUpTable
bx r3 # indirect register sibling call # tmp113
.L13:
bx lr #
# in the .rodata section
lookUpTable:
.word func0
.word func1
.word func2
.word func3
See Mike of SST's answer for a similar analysis on a Microchip dsPIC.
msc's answer and the comments give you good hints as to why performance may not be what you expect. Benchmarking is the rule, but results will vary from one architecture to another, and may change with other versions of the compiler and of course its configuration and options selected.
Note however that your 2 pieces of code do not perform the same validation on state:
The switch will gracefully do nothing is state is not one of the defined values,
The jump table version will invoke undefined behavior for all but the 2 values FUNC1 and FUNC2.
There is no generic way to initialize the jump table with dummy function pointers without making assumptions on FUNC_COUNT. Do get the same behavior, the jump table version should look like this:
void fsm(int state) {
if (state >= 0 && state < FUNC_COUNT && lookUpTable[state] != NULL)
lookUpTable[state]();
}
Try benchmarking this and inspect the assembly code. Here is a handy online compiler for this: http://gcc.godbolt.org/#
On the Microchip dsPIC family of devices a look-up table is stored as a set of instruction addresses in the Flash itself. Performing the look-up involves reading the address from the Flash then calling the routine. Making the call adds another handful of cycles to push the instruction pointer and other bits and bobs (e.g. setting the stack frame) of housekeeping.
For example, on the dsPIC33E512MU810, using XC16 (v1.24) the look-up code:
lookUpTable[state]();
Compiles to (from the disassembly window in MPLAB-X):
! lookUpTable[state]();
0x2D20: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D22: ADD W4, W4, W5 ; 1 cycle (addresses are 16 bit aligned)
0x2D24: MOV #0xA238, W4 ; 1 cycle (get base address of look-up table)
0x2D26: ADD W5, W4, W4 ; 1 cycle (get address of entry in table)
0x2D28: MOV [W4], W4 ; 1 cycle (get address of the function)
0x2D2A: CALL W4 ; 2 cycles (push PC+2 set PC=W4)
... and each (empty, do-nothing) function compiles to:
!static void func1()
!{}
0x2D0A: LNK #0x0 ; 1 cycle (set up stack frame)
! Function body goes here
0x2D0C: ULNK ; 1 cycle (un-link frame pointer)
0x2D0E: RETURN ; 3 cycles
This is a total of 11 instruction cycles of overhead for any of the cases, and they all take the same. (Note: If either the table or the functions it contains are not in the same 32K program word Flash page, there will be an even greater overhead due to having to get the Address Generation Unit to read from the correct page, or to set up the PC to make a long call.)
On the other hand, providing that the whole switch statement fits within a certain size, the compiler will generate code that does a test and relative branch as two instructions per case taking three (or possibly four) cycles per case up to the one that's true.
For example, the switch statement:
switch(state)
{
case FUNC1: state++; break;
case FUNC2: state--; break;
default: break;
}
Compiles to:
! switch(state)
0x2D2C: MOV [W14], W4 ; get state from stack-frame (not counted)
0x2D2E: SUB W4, #0x0, [W15] ; 1 cycle (compare with first case)
0x2D30: BRA Z, 0x2D38 ; 1 cycle (if branch not taken, or 2 if it is)
0x2D32: SUB W4, #0x1, [W15] ; 1 cycle (compare with second case)
0x2D34: BRA Z, 0x2D3C ; 1 cycle (if branch not taken, or 2 if it is)
! {
! case FUNC1: state++; break;
0x2D38: INC [W14], [W14] ; To stop the switch being optimised out
0x2D3A: BRA 0x2D40 ; 2 cycles (go to end of switch)
! case FUNC2: state--; break;
0x2D3C: DEC [W14], [W14] ; To stop the switch being optimised out
0x2D3E: NOP ; compiler did a fall-through (for some reason)
! default: break;
0x2D36: BRA 0x2D40 ; 2 cycles (go to end of switch)
! }
This is an overhead of 5 cycles if the first case is taken, 7 if the second case is taken, etc., meaning they break even on the fourth case.
This means that knowing your data at design time will have a significant influence on the long-term speed. If you have a significant number (more than about 4 cases) and they all occur with similar frequency then a look-up table will be quicker in the long run. If the frequency of the cases is significantly different (e.g. case 1 is more likely than case 2, which is more likely than case 3, etc.) then, if you order the switch with the most likely case first, then the switch will be faster in the long run. For the edge case when you only have a few cases the switch will (probably) be faster anyway for most executions and is more readable and less error prone.
If there are only a few cases in the switch, or some cases will occur more often than others, then doing the test and branch of the switch will probably take fewer cycles than using a look-up table. On the other hand, if you have more than a handful of cases of that occur with similar frequency then the look-up will probably end up being faster on average.
Tip: Go with the switch unless you know the look-up will definitely be faster and the time it takes to run is important.
Edit: My switch example is a little unfair, as I've ignored the original question and in-lined the 'body' of the cases to highlight the real advantage of using a switch over a look-up. If the switch has to do the call as well then it only has the advantage for the first case!
To have even more compiler outputs, here what is produced by the TI C28x compiler using #PeterCordes sample code:
_fsm_switch:
CMPB AL,#0 ; [CPU_] |62|
BF $C$L3,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#1 ; [CPU_] |62|
BF $C$L2,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#2 ; [CPU_] |62|
BF $C$L1,EQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
CMPB AL,#3 ; [CPU_] |62|
BF $C$L4,NEQ ; [CPU_] |62|
; branchcc occurs ; [] |62|
LCR #_func3 ; [CPU_] |66|
; call occurs [#_func3] ; [] |66|
B $C$L4,UNC ; [CPU_] |66|
; branch occurs ; [] |66|
$C$L1:
LCR #_func2 ; [CPU_] |65|
; call occurs [#_func2] ; [] |65|
B $C$L4,UNC ; [CPU_] |65|
; branch occurs ; [] |65|
$C$L2:
LCR #_func1 ; [CPU_] |64|
; call occurs [#_func1] ; [] |64|
B $C$L4,UNC ; [CPU_] |64|
; branch occurs ; [] |64|
$C$L3:
LCR #_func0 ; [CPU_] |63|
; call occurs [#_func0] ; [] |63|
$C$L4:
LCR #_prevent_tailcall ; [CPU_] |69|
; call occurs [#_prevent_tailcall] ; [] |69|
LRETR ; [CPU_]
; return occurs ; []
_fsm_lut:
;* AL assigned to _state
CMPB AL,#4 ; [CPU_] |84|
BF $C$L5,HIS ; [CPU_] |84|
; branchcc occurs ; [] |84|
CLRC SXM ; [CPU_]
MOVL XAR4,#_lookUpTable ; [CPU_U] |85|
MOV ACC,AL << 1 ; [CPU_] |85|
ADDL XAR4,ACC ; [CPU_] |85|
MOVL XAR7,*+XAR4[0] ; [CPU_] |85|
LCR *XAR7 ; [CPU_] |85|
; call occurs [XAR7] ; [] |85|
$C$L5:
LCR #_prevent_tailcall ; [CPU_] |88|
; call occurs [#_prevent_tailcall] ; [] |88|
LRETR ; [CPU_]
; return occurs ; []
I also used -O2 optimizations.
We can see that the switch is not converted into a jump table even if the compiler has the ability.

Using #defined values before RAM has been initialised

I am writing the boot-up code for an ARM CPU. There is no internal RAM, but there is 1GB of DDRAM connected to the CPU, which is not directly accessible before initialisation. The code is stored in flash, initialises RAM, then copies itself and the data segment to RAM and continue execution there. My program is:
#define REG_BASE_BOOTUP 0xD0000000
#define INTER_REGS_BASE REG_BASE_BOOTUP
#define SDRAM_FTDLL_REG_DEFAULT_LEFT 0x887000
#define DRAM_BASE 0x0
#define SDRAM_FTDLL_CONFIG_LEFT_REG (DRAM_BASE+ 0x1484)
... //a lot of registers
void sdram_init() __attribute__((section(".text_sdram_init")));
void ram_init()
{
static volatile unsigned int* const sdram_ftdll_config_left_reg = (unsigned int*)(INTER_REGS_BASE + SDRAM_FTDLL_CONFIG_LEFT_REG);
... //a lot of registers assignments
*sdram_ftdll_config_left_reg = SDRAM_FTDLL_REG_DEFAULT_LEFT;
}
At the moment my program is not working correctly because the register values end up being linked to RAM, and at the moment the program tries to access them only the flash is usable.
How could I change my linker script or my program so that those values have their address in flash? Is there a way I can have those values in the text segment?
And actually are those defined values global or static data when they are declared at file scope?
Edit:
The object file is linked with the following linker script:
MEMORY
{
RAM (rw) : ORIGIN = 0x00001000, LENGTH = 12M-4K
ROM (rx) : ORIGIN = 0x007f1000, LENGTH = 60K
VECTOR (rx) : ORIGIN = 0x007f0000, LENGTH = 4K
}
SECTIONS
{
.startup :
{
KEEP((.text.vectors))
sdram_init.o(.sdram_init)
} > VECTOR
...
}
Disassembly from the register assignment:
*sdram_ftdll_config_left_reg = SDRAM_FTDLL_REG_DEFAULT_LEFT;
7f0068: e59f3204 ldr r3, [pc, #516] ; 7f0274 <sdram_init+0x254>
7f006c: e5932000 ldr r2, [r3]
7f0070: e59f3200 ldr r3, [pc, #512] ; 7f0278 <sdram_init+0x258>
7f0074: e5823000 str r3, [r2]
...
7f0274: 007f2304 .word 0x007f2304
7f0278: 00887000 .word 0x00887000
To answer your question directly -- #defined values are not stored in the program anywhere (besides possibly in debug sections). Macros are expanded at compile time as if you'd typed them out in the function, something like:
*((unsigned int *) 0xd0010000) = 0x800f800f;
The values do end up in the text segment, as part of your compiled code.
What's much more likely here is that there's something else you're doing wrong. Off the top of my head, my first guess would be that your stack isn't initialized properly, or is located in a memory region that isn't available yet.
There are a few options to solve this problem.
Use PC relative data access.
Use a custom linker script.
Use assembler.
Use PC relative data access
The trouble you have with this method is you must know details of how the compiler will generate code. #define register1 (volatile unsigned int *)0xd0010000UL is that this is being stored as a static variable which is loaded from the linked SDRAM address.
7f0068: ldr r3, [pc, #516] ; 7f0274 <sdram_init+0x254>
7f006c: ldr r2, [r3] ; !! This is a problem !!
7f0070: ldr r3, [pc, #512] ; 7f0278 <sdram_init+0x258>
7f0074: str r3, [r2]
...
7f0274: .word 0x007f2304 ; !! This memory doesn't exist.
7f0278: .word 0x00887000
You must do this,
void ram_init()
{
/* NO 'static', you can not do that. */
/* static */ volatile unsigned int* const sdram_reg =
(unsigned int*)(INTER_REGS_BASE + SDRAM_FTDLL_CONFIG_LEFT_REG);
*sdram_ftdll_config_left_reg = SDRAM_FTDLL_REG_DEFAULT_LEFT;
}
Or you may prefer to implement this in assembler as it is probably pretty obtuse as to what you can and can't do here. The main effect of the above C code is that every thing is calculated or PC relative. If you opt not to use a linker script, this must be the case. As Duskwuff points out, you also can have stack issues. If you have no ETB memory, etc, that you can use as a temporary stack then it probably best to code this in assembler.
Linker script
See gnu linker map... and many other question on using a linker script in this case. If you want specifics, you need to give actual addresses use by the processor. With this option you can annotate your function to specify which section it will live in. For instance,
void ram_init() __attribute__((section("FLASH")));
In this case, you would use the Gnu Linkers MEMORY statement and AT statements to put this code at the flash address where you desire it to run from.
Use assembler
Assembler gives you full control over memory use. You can garentee that no stack is used, that no non-PC relative code is generated and it will probably be faster to boot. Here is some table driven ARM assembler I have used for the case you describe, initializing an SDRAM controller.
/* Macro for table of register writes. */
.macro DCDGEN,type,addr,data
.long \type
.long \addr
.long \data
.endm
.set FTDLL_CONFIG_LEFT, 0xD0001484
sdram_init:
DCDGEN 4, FTDLL_CONFIG_LEFT, 0x887000
1:
init_sdram_bank:
adr r0,sdram_init
adr r1,1b
1:
/* Delay. */
mov r5,#0x100
2: subs r5,r5,#1
bne 2b
ldmia r0!, {r2,r3,r4} /* Load DCD entry. */
cmp r2,#1 /* byte? */
streqb r4,[r3] /* Store byte... */
strne r4,[r3] /* Store word. */
cmp r0,r1 /* table done? */
blo 1b
bx lr
/* Dump literal pool. */
.ltorg
Assembler has many benefits. You can also clear the bss section and setup the stack with simple routines. There are many on the Internet and I think you can probably code one yourself. The gnu ld script is also beneficial with assembler as you can ensure that sections like bss are aligned and a multiple of 4,8,etc. so that the clearing routine doesn't need special cases. Also, you will have to copy the code from flash to SDRAM after it is initialized. This is a fairly expensive/long running task and you can speed it up with some short assembler.

Resources