Initialising global variables in C in Harvard CPU - c

I build a 32-bit RISC-V CPU with Harvard architecture and I want to write programs for it in C. I have a RISC-V compiler set (https://xpack.github.io/riscv-none-embed-gcc/) that can do just that and works fine - for most things. The problem starts when I want to work with global variables, global arrays, etc, because these types get copied to RAM on boot/reset by the start script.
Here is a block diagram of my CPU: (This will be important later. Just note that the Instruction memory = FLASH and Data memory = RAM)
(If you are interested about my CPU, I made a video about it: https://www.youtube.com/watch?v=KzSaFFpBPDM)
Example:
A typical program will look something like this:
#include <stdint.h>
int static_var_1 = 2;
int static_var_2 = 4;
int main(void)
{
int var = static_var_1 + static_var_2;
}
And its objdump something like this:
/opt/xpack-riscv-none-embed-gcc-10.1.0-1.1/riscv-none-embed/bin/objdump build/APP.elf -D
build/APP.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <_start>:
0: 00080137 lui sp,0x80
4: ffc10113 addi sp,sp,-4 # 7fffc <_estack>
8: 00c000ef jal ra,14 <main>
c: 0040006f j 10 <_exit>
00000010 <_exit>:
10: 0000006f j 10 <_exit>
00000014 <main>:
14: fe010113 addi sp,sp,-32
18: 00812e23 sw s0,28(sp)
1c: 02010413 addi s0,sp,32
20: 00002703 lw a4,0(zero) # 0 <_start>
24: 00402783 lw a5,4(zero) # 4 <static_var_2>
28: 00f707b3 add a5,a4,a5
2c: fef42623 sw a5,-20(s0)
30: 00000793 li a5,0
34: 00078513 mv a0,a5
38: 01c12403 lw s0,28(sp)
3c: 02010113 addi sp,sp,32
40: 00008067 ret
Disassembly of section .data:
00000000 <static_var_1>:
0: 0002 c.slli64 zero
...
00000004 <static_var_2>:
4: 0004 0x4
...
Disassembly of section ._user_heap_stack:
00000008 <._user_heap_stack>:
...
(the <_start> is a part of my start script that will initialize stack pointer)
The catch:
These are the two instructions that tries to load the global variables:
20: 00002703 lw a4,0(zero) # 0 <_start>
24: 00402783 lw a5,4(zero) # 4 <static_var_2>
But there is a problem - they were never put into RAM, so the CPU will most likely end up with some garbage data, which is unacceptable.
The solution?
Somebody suggested linker relaxation as part of my previous question (RISC-V: Global variables), again, that doesn't seem to be the case, but I can still be wrong though!
From my research, most of the "classic" CPUs use a start script, where the copying takes place, but as this is not a von-neuman architecture, I don't have the FLASH memory mapped to data memory and therefor cannot be read by the program (see the block diagram above). The output program must contain the variables already decoded as executable instructions, for example if we want value 0x4 in RAM at position 0x0, It can be decoded to:
addi t0, zero, 0x4
sw t0, 0(zero)
Re-building my CPU as von-neuman would require much more gates and ICs and this is a discrete build where every IC counts.
Doing it by hardware is for me the worst solution as I stated above, so if it can be done in software I'm all for it - and It can! Obviously, there is a solution, but by far the ugliest: Compile the code, extract the data (with python), generate a new startup script with these variables decoded by the python script and compile it again.
I really don't want to go that route, so if it can be done by modifying startup script, linker, etc, it would be really, really great.
AVR ICs are basically Harvard architecture (though modified) so do they something differently that we can learn from?

Related

What %lo(source)($6) and .frame mean in assembly code?

I assemble a simple c program to mips and try to understand the assembly code. By comparing with c code, I almost understand the it but still get some problems.
I use mips-gcc to generate assembly code: $ mips-gcc -S -O2 -fno-delayed-branch -I/usr/include lab3_ex3.c -o lab3_ex3.s
Here is my guess about how the assembly code works:
main is the entry of the program.
$6 is the address of source array.
$7 is the address of dest array.
$3 is the size of source array.
$2 is the variable k and is initialized to 0.
$L3 is the loop
$5 and $4 are addresses of source[k] and dest[k].
sw $3,0($5) is equivalent to store source[k] in $3.
lw $3,4($4) is equivalent to assign source[k] to dest[k].
addiu $2,$2,4 is equivalent to k++.
bne $3, $0, $L3 means that if source[k] is zero then exits the loop otherwise jump to lable $L3.
$L2 just do some clean up work.
Set $2 to zero.
Jump to $31 (return address).
My problems is:
What .frame $sp,0,$31 does?
Why lw $3,4($4) instead of lw $3,0($4)
What is the notation%lo(source)($6) means? ($hi and $lo$ registers are used in multiply so why they are used here?)
Thanks.
C
int source[] = {3, 1, 4, 1, 5, 9, 0};
int dest[10];
int main ( ) {
int k;
for (k=0; source[k]!=0; k++) {
dest[k] = source[k];
}
return 0;
}
Assembly
.file 1 "lab3_ex3.c"
.section .mdebug.eabi32
.previous
.section .gcc_compiled_long32
.previous
.gnu_attribute 4, 1
.text
.align 2
.globl main
.set nomips16
.ent main
.type main, #function
main:
.frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0
.mask 0x00000000,0
.fmask 0x00000000,0
lui $6,%hi(source)
lw $3,%lo(source)($6)
beq $3,$0,$L2
lui $7,%hi(dest)
addiu $7,$7,%lo(dest)
addiu $6,$6,%lo(source)
move $2,$0
$L3:
addu $5,$7,$2
addu $4,$6,$2
sw $3,0($5)
lw $3,4($4)
addiu $2,$2,4
bne $3,$0,$L3
$L2:
move $2,$0
j $31
.end main
.size main, .-main
.globl source
.data
.align 2
.type source, #object
.size source, 28
source:
.word 3
.word 1
.word 4
.word 1
.word 5
.word 9
.word 0
.comm dest,40,4
.ident "GCC: (GNU) 4.4.1"
Firstly, main, $L3 and $L2 are labels for 3 basic blocks. You are roughly correct about their functions.
Question 1: What is .frame doing
This is not a MIPS instruction. It is metadata describing the (stack) frame for this function:
The stack is pointed to by $sp, an alias for $29.
and the size of the stack frame (0, since the function has neither local variables, nor arguments on the stack). Further, the function is simple enough that it can work with scratch registers and does not need to save callee-saved registers $16-$23.
the old return address ($31 for MIPS calling convention)
For more information regarding the MIPS calling convention, see this doc.
Question 2: Why lw $3,4($4) instead of lw $3,0($4)
This is due to an optimization of the loop. Normally, the sequence of loads and stores would be :
load source[0]
store dest[0]
load source[1]
store dest[1]
....
You assume that the loop is entirely in $L3, and that contains load source[k] and store dest[k]. It isn't. There are two clues to see this:
There is a load in the block main which does not correspond to any load outside the loop
Within the basic block $L3, the store is before the load.
In fact, load source[0] is performed in the basic-block named main. Then, the loop in the basic block $L3 is store dest[k];load source[k+1];. Therefore, the load uses an offset of 4 more than the offset of the store, because it is loading the integer for the next iteration.
Question 3: What is the lo/hi syntax?
This has to do with instruction encodings and pointers. Let us assume a 32-bit architecture, i.e. a pointer is 32 bits. Like most fixed-size instruction ISAs, let us assume that the instruction size is also 32 bits.
Before loading and storing from the source/dest arrays, you need to load their pointers into registers $6 and $7 respectively. Therefore, you need an instruction to load a 32-bit constant address into a register. However, a 32-bit instruction must contain a few bits to encode opcodes (which operation the instruction is), destination register etc. Therefore, an instruction has less than 32 bits left to encode constants (called immediates). Therefore, you need two instructions to load a 32-bit constant into a register, each loading 16 bits. The lo/hi refer to which half of the constant is loaded.
Example: Assume that dest is at address 0xabcd1234. There are two instructions to load this value into $7.
lui $7,%hi(dest)
addiu $7,$7,%lo(dest)
lui is Load Upper immediate. It loads the top 16 bits of the address of dest (0xabcd) into the top 16 bits of $7. Now, the value of $7 is 0xabcd0000.
addiu is Add Immediate Unsigned. It adds the lower 16 bits of the address of dest (0x1234) with the existing value in $7 to get the new value of $7. Thus, $7 now holds 0xabcd0000 + 0x1234 = 0xabcd1234, the address of dest.
Similarly, lw $3,%lo(source)($6) loads from the address pointed to by $6 (which already holds the top 16 bits of the address of source) at an offset of %lo(source) (the bottom 16 bits of that address). Effectively, it loads the first word of source.

How to use TBB instruction (Cortex-M3) with gnu assembler?

Section 3.10.4 of the Arm generic user guide (page 172) gives an example for using TBB, but the example uses Arm assembler. I would like to learn how to use TBB with gas, but can't seem to figure out how. How should I revise the example from the guide to implement a switch statement with gas instead of armasm?
ADR.W R0, BranchTable_Byte
TBB [R0, R1] ; R1 is the index, R0 is the base address of the
; branch table
Case1
; an instruction sequence follows
Case2
; an instruction sequence follows
Case3
; an instruction sequence follows
BranchTable_Byte
DCB 0 ; Case1 offset calculation
DCB ((Case2-Case1)/2) ; Case2 offset calculation
DCB ((Case3-Case1)/2) ; Case3 offset calculation
I'm new to using gas and am not sure if I should be defining the branch table in a .data section at the beginning of the assembler file or if it should go after my switch statement in the .text section.
.cpu cortex-m3
.thumb
.syntax unified
ADR.W R0, BranchTable_Byte
TBB [R0, R1] #; R1 is the index, R0 is the base address of the
#; branch table
Case1:
#; an instruction sequence follows
nop
Case2:
#; an instruction sequence follows
nop
nop
Case3:
#; an instruction sequence follows
nop
nop
nop
BranchTable_Byte:
.byte 0 #; Case1 offset calculation
.byte ((Case2-Case1)/2) #; Case2 offset calculation
.byte ((Case3-Case1)/2) #; Case3 offset calculation
something like this perhaps. Need colons on labels. ; is sadly not a comment anymore # is, got lucky on the label math.

ARM PC value after Reset

I am new to MCU and trying to figure out how arm (Cortex M3-M4) based MCU boots. Because booting is specific to any SOC, I took an example hardware board of STM for case study.
Board: STMicroelectronics – STM32L476 32-bit.
In this board when booting mode is (x0)"Boot from User Flash", board maps 0x0000000 address to flash memory address. On flash memory I have pasted my binary with first 4 bytes pointing to vector table first entry, which is esp. Now if I press reset button ARM documentation says PC value will be set to 0x00000000.
CPU generally executes stream of instructions based on PC -> PC + 1 loop. In this case if I see PC value points to esp, which is not instruction. How does Arm CPU does the logic of not use this instruction address, but do a jump to value store at address 0x00000004?
Or this is the case:
Reset produces a special hardware interrupt and cause PC value to be value at 0x00000004, if this is the case why Arm documentation says it sets PC value to 0x00000000?
Ref: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3761.html
What values are in ARM registers after a power-on reset? Applies to:
ARM1020/22E, ARM1026EJ-S, ARM1136, ARM720T, ARM7EJ-S, ARM7TDMI,
ARM7TDMI-S, ARM920/922T, ARM926EJ-S, ARM940T, ARM946E-S, ARM966E-S,
ARM9TDMI
Answer Registers R0 - R14 (including banked registers) and SPSR (in
all modes) are undefined after reset.
The Program Counter (PC/R15) will be set to 0x000000, or 0xFFFF0000 if
the core has a VINITHI or CFGHIVECS input which is set high as the
core leaves reset. This input should be set to reflect where the base
of the vector table in your system is located.
The Current Program Status Register (CPSR) will indicate that the ARM
core has started in ARM state, Supervisor mode with both FIQ and IRQ
mask bits set. The condition code flags will be undefined. Please see
the ARM Architecture Manual for a detailed description of the CPSR.
The cortex-m's do not boot the same way the traditional and full sized cores boot. Those at least for the reset as you pointed out fetch from address 0x00000000 (or the alternate if asserted) the first instructions, not really fair to call it the PC value as at this point the PC is somewhat bugus, there are multiple program counters being produced a fake one in r15, one leading the fetching, one doing prefetch, none are really the program counter. anyway, doesnt matter.
The cortex-m as documented in the armv7-m documentation (for the m3 and m4, for the m0 and m0+ see the armv6-m although they so far all boot the same way). These use a vector TABLE not instructions. The CORE reads address 0x00000000 (or an alternate if a strap is asserted) and that 32 bit value gets loaded into the stack pointer register. it reads address 0x00000004 it checks the lsbit (maybe not all cores do) if set then this is a valid thumb address, strips the lsbit off (makes it a zero) and begins to fetch the first instructions for the reset handler at that address so if your flash starts with
0x00000000 : 0x20001000
0x00000004 : 0x00000101
the cortex-m will put 0x20001000 in the stack pointer and fetch the first instructions from address 0x100. Being thumb instructions are 16 bits with thumb2 extensions being two 16 bit portions, its not an x86 the program counter is aligned for the full sized processors with 32 bit instructions it fetches on aligned addresses 0x0000, 0x0004, 0x0008 it doesnt increment pc <= pc + 1; For thumb mode or thumb processors it is pc = pc + 2. But also the fetches are not necessarily single instruction transactions, for the full sized they may fetch 4 or 8 words per transaction, the cortex-ms as documented in the technical reference manuals some are able to be compiled or strapped to 16 bits at a time or 32 bits at a time. So no need to talk about or think about execution loops fetching pc = pc + 1, that doesnt make sense even in an x86 these days.
to be fair arms documentation is generally good, on the better side compared to a number of others, not the best. Unlike the full sized arm exception table, the vector table in the cortex-m documentation was not done as well as it could have been, could have/should have just done something like the full sized but shown they were vectors not instructions. It is in there though in the architectural reference manual for the armv6-m and armv7-m (and I would assume armv8-m as well but have not looked, got some parts last week but boards are not here yet, will know very soon). Cant look for words like reset have to look for interrupt or undefined or hardfault, etc in that manual.
EDIT
unwrap your mind on this notion of how the processor starts fetching, it can be any arbitrary address they add into the design, and then the execution of the instructions determines the next address and next address, etc.
Also understand unlike say x86 or microchip pic or the avrs, etc, the core and the chips are two different companies. Even in those same company designs, but certainly where there is a clear division between the IP with a known bus, the ARM CORE will read address 0x00000004 on the AMBA/AXI/AHB bus, the chip vendor can mirror that address in as many different places as they want, in this case with the stm32 there probably isnt actually anything at 0x00000000 as their documentation implies based on the boot pins they map it either to an internal bootloader, or they map it to the user application at 0x08000000 (or in most stm32's if there is an exception thats fine I have not yet seen it) so when strapped that way and the logic has those addresses mirrored you will see the same 32 bit values at 0x00000000 and 0x08000000, 0x00000004 and 0x08000004 and so on for some limited amount of address space. This is why even though linking for 0x00000000 will work to some extent (till you hit that limit which is probably smaller than the application flash size), you will see most folks link for 0x08000000 and the hardware takes care of the rest, so your table really wants to look like
0x08000000 : 0x20001000
0x08000004 : 0x08000101
for an stm32, at least the dozens I have seen so far.
The processor reads 0x00000000 which is mirrored to the first item in the application flash, finds 0x20001000, it then reads 0x00000004 which is mirroed to the second word in the application flash and gets 0x08000101 which causes a fetch from 0x08000100 and now we are executing from the proper fully mapped application flash address space. so long as you dont change the mirroring, which I dont know if you can on an stm32 (nxp chips you can and I dont know about ti or other brands off hand). Some of the cortex-m cores the VTOR register is there and changable (others it is fixed at 0x00000000 and you cant change it), you do not need to change it to 0x08000000 for an stm32, at least all the ones I know about. its only if you are actively changing the mirroring of the zero address space yourself if possible or if you say have your own bootloader and maybe YOUR application space is 0x08004000 and that application wants a vector table of its own. then you either use VTOR or you build the bootloaders vector table such that it runs code that reads the vectors at 0x08004000 and branches to those. The NXP and others in the past certainly with the ARMV7TDMI cores, would let you change the mirroring of address zero because those older cores didnt have a programmable vector table offset register, helping you solve that problem in their chip designs. Newer ARM cores with a VTOR eliminate that need and over time the chip vendors might not bother anymore if they do at all...
EDIT
I dont know if you have the discovery board or the nucleo, I assume the latter as the former is not available (wish I knew about that one would like to have one. And/or I already have one and its buried in a drawer and I never got to it).
so here is a somewhat minimal program you can try on your stm32
.cpu cortex-m0
.thumb
.globl _start
_start:
.word 0x20000400
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,=0x20000000
mov r2,sp
str r2,[r0]
add r0,r0,#4
mov r2,pc
str r2,[r0]
add r0,r0,#4
mov r1,#0
top:
str r1,[r0]
add r1,r1,#1
b top
build
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld -Ttext=0x08000000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary so.bin
this should build with arm-linux-whatever- or other arm-whatever-whatever tools from a binutils from the last 10 years.
The disassembly is important to examine before using the binary, dont want to brick your chip (with an stm32 there is a way to get unbricked)
08000000 <_start>:
8000000: 20000400 andcs r0, r0, r0, lsl #8
8000004: 08000013 stmdaeq r0, {r0, r1, r4}
8000008: 08000011 stmdaeq r0, {r0, r4}
800000c: 08000011 stmdaeq r0, {r0, r4}
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
8000012: 4805 ldr r0, [pc, #20] ; (8000028 <top+0x6>)
8000014: 466a mov r2, sp
8000016: 6002 str r2, [r0, #0]
8000018: 3004 adds r0, #4
800001a: 467a mov r2, pc
800001c: 6002 str r2, [r0, #0]
800001e: 3004 adds r0, #4
8000020: 2100 movs r1, #0
08000022 <top>:
8000022: 6001 str r1, [r0, #0]
8000024: 3101 adds r1, #1
8000026: e7fc b.n 8000022 <top>
8000028: 20000000 andcs r0, r0, r0
the disassembler doesnt know that the vector table is not instructions so you can ignore those.
08000000 <_start>:
8000000: 20000400
8000004: 08000013
8000008: 08000011
800000c: 08000011
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
Does it start the vector table at 0x08000000, check. Our stack pointer init value is at 0x00000000, yes, the reset vector we had the tools place for us. thumb_func tells them the following label is an address for some code/function/procedure/whatever_not_data so they orr the one on there for us. our reset handler is at address 0x08000012 so we want to see 0x08000013 in the vector table, check. I tossed in a couple more for demonstration purposes, sent them to an infinite loop at address 0x08000010 so the vector table should have 0x08000011, check.
So assuming you have a nucleo board not the discovery then you can copy the so.bin file to the thumb drive that shows up when you plug it in.
If you use openocd to connect through the stlink interface into the board now you can see that it was running (details left to the reader to figure out)
Open On-Chip Debugger
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 0048cd01 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
> resume
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 005e168c 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
so we can see that the stack pointer had 0x20000400 as expected
0x20000000: 20000400 0800001e 0048cd01
the program counter which is not some magical thing, they have to somewhat fake it to make the instruction set work.
800001a: 467a mov r2, pc
as defined in the instruction set the pc value used in this instruction is two instructions ahead of the address of this instruction, so 0x0800001A + 4 = 0x0800001E which is what we see in the memory dump.
And the third item is a counter showing we are running, the resume and halt shows that that count kept going
0x20000000: 20000400 0800001e 005e168
So this demonstrates, the vector table, initializing the stack pointer, the reset vector, where code execution starts, what the value of the pc is at some point in the program, and seeing the program run.
the .cpu cortex-m0 makes it build the most compatible program for the cortex-m family and the mov r0,=0x20000000 was cheating, you posted the same feature in your comment it says I want to load the address of blah into the register a label is just an address and they let you put just an address =_estack is the address of a label =0x20000000 is just a number treated as an address (addresses are just numbers as well, nothing magical about them). I could have done a smaller immediate with a shift or explicitly have done the pc relative load. force of habit in this case.
EDIT2
In attempt for a programmer to understand that the chip is logic, only some percentage of it is software/instruction driven, even within that it is just logic that does more things than the software instruction itself indicates. You want to read from memory your instruction asks the processor to do it but in a real chip there are a number of steps involved to actually perform that, microcoded or not (ARMs are not microcoded) there are state machines that walk through the various steps to perform each of these tasks. grab the values from registers, compute the address, do the memory transaction which is a handful of separate steps, take the return value and place it in the register file.
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,loop_counts
loop_top:
sub r0,r0,#1
bne loop_top
b reset
.align
loop_counts: .word 0x1234
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000013 andeq r0, r0, r3, lsl r0
8: 00000011 andeq r0, r0, r1, lsl r0
c: 00000011 andeq r0, r0, r1, lsl r0
00000010 <loop>:
10: e7fe b.n 10 <loop>
00000012 <reset>:
12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
00000014 <loop_top>:
14: 3801 subs r0, #1
16: d1fd bne.n 14 <loop_top>
18: e7fb b.n 12 <reset>
1a: 46c0 nop ; (mov r8, r8)
0000001c <loop_counts>:
1c: 00001234 andeq r1, r0, r4, lsr r2
Just barely enough of an instruction set simulator to run that program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROMMASK 0xFFFF
#define RAMMASK 0xFFF
unsigned short rom[ROMMASK+1];
unsigned short ram[RAMMASK+1];
unsigned int reg[16];
unsigned int pc;
unsigned int cpsr;
unsigned int inst;
int main ( void )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
unsigned int rx;
//just putting something there, a real chip might have an MBIST, might not.
memset(reg,0xBA,sizeof(reg));
memset(ram,0xCA,sizeof(ram));
memset(rom,0xFF,sizeof(rom));
//in a real chip the rom/flash would contain the program and not
//need to do anything to it, this sim needs to have the program
//various ways to have done this...
//00000000 <_start>:
rom[0x00>>1]=0x1000; // 0: 20001000 andcs r1, r0, r0
rom[0x02>>1]=0x2000;
rom[0x04>>1]=0x0013; // 4: 00000013 andeq r0, r0, r3, lsl r0
rom[0x06>>1]=0x0000;
rom[0x08>>1]=0x0011; // 8: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0A>>1]=0x0000;
rom[0x0C>>1]=0x0011; // c: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0E>>1]=0x0000;
//
//00000010 <loop>:
rom[0x10>>1]=0xe7fe; // 10: e7fe b.n 10 <loop>
//
//00000012 <reset>:
rom[0x12>>1]=0x4802; // 12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
//
//00000014 <loop_top>:
rom[0x14>>1]=0x3801; // 14: 3801 subs r0, #1
rom[0x16>>1]=0xd1fd; // 16: d1fd bne.n 14 <loop_top>
rom[0x18>>1]=0xe7fb; // 18: e7fb b.n 12 <reset>
rom[0x1A>>1]=0x46c0; // 1a: 46c0 nop ; (mov r8, r8)
//
//0000001c <loop_counts>:
rom[0x1C>>1]=0x0004; // 1c: 00001234 andeq r1, r0, r4, lsr r2
rom[0x1E>>1]=0x0000;
//reset
//THIS IS NOT SOFTWARE DRIVEN LOGIC, IT IS JUST LOGIC
ra=rom[0x00>>1];
rb=rom[0x02>>1];
reg[14]=(rb<<16)|ra;
ra=rom[0x04>>1];
rb=rom[0x06>>1];
rc=(rb<<16)|ra;
if((rc&1)==0) return(1); //normally run a fault handler here
pc=rc&0xFFFFFFFE;
reg[15]=pc+2;
cpsr=0x000000E0;
//run
//THIS PART BELOW IS SOFTWARE DRIVEN LOGIC
//still you can see that each instruction requires some amount of
//non-software driven logic.
//while(1)
for(rx=0;rx<20;rx++)
{
inst=rom[(pc>>1)&ROMMASK];
printf("0x%08X : 0x%04X\n",pc,inst);
reg[15]=pc+4;
pc+=2;
if((inst&0xF800)==0x4800)
{
//LDR
printf("LDR r%02u,[PC+0x%08X]",(inst>>8)&0x7,(inst&0xFF)<<2);
ra=(inst>>0)&0xFF;
rb=reg[15]&0xFFFFFFFC;
ra=rb+(ra<<2);
printf(" {0x%08X}",ra);
rb=rom[((ra>>1)+0)&ROMMASK];
rc=rom[((ra>>1)+1)&ROMMASK];
ra=(inst>>8)&0x07;
reg[ra]=(rc<<16)|rb;
printf(" {0x%08X}\n",reg[ra]);
continue;
}
if((inst&0xF800)==0x3800)
{
//SUB
ra=(inst>>8)&0x07;
rb=(inst>>0)&0xFF;
printf("SUBS r%u,%u ",ra,rb);
rc=reg[ra];
rc-=rb;
reg[ra]=rc;
printf("{0x%08X}\n",rc);
//do flags
if(rc==0) cpsr|=0x80000000; else cpsr&=(~0x80000000); //N flag
//dont need other flags for this example
continue;
}
if((inst&0xF000)==0xD000) //B conditional
{
if(((inst>>8)&0xF)==0x1) //NE
{
ra=(inst>>0)&0xFF;
if(ra&0x80) ra|=0xFFFFFF00;
rb=reg[15]+(ra<<1);
printf("BNE 0x%08X\n",rb);
if((cpsr&0x80000000)==0)
{
pc=rb;
}
continue;
}
}
if((inst&0xF000)==0xE000) //B
{
ra=(inst>>0)&0x7FF;
if(ra&0x400) ra|=0xFFFFF800;
rb=reg[15]+(ra<<1);
printf("B 0x%08X\n",rb);
pc=rb;
continue;
}
printf("UNDEFINED INSTRUCTION 0x%08X: 0x%04X\n",pc-2,inst);
break;
}
return(0);
}
You are welcome to hate my coding style, this is a brute force thrown together for this question thing. No I dont work for ARM, this can all be pulled from public documents/information. I shortened the loop to 4 counts to see it hit the outer loop
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
Perhaps this helps perhaps this makes it worse. Most of the logic is not driven by instructions, each instruction, requires some amount of logic not counting the common logic like instruction fetching and things like that.
If you add more code this simulator will break it ONLY supports these handful of instructions and this loop.
The most important thing to check when you're confused about some behaviour of an Arm processor is probably to check the version of the architecture which applies. You will find a huge amount of very old legacy documentation which relates to ARM7 and ARM9 designs. Whilst not all of this is wrong today, it can be very misleading.
ARM v4, ARM v5, ARM v6: These are legacy designs, rarely even used in derivative products now.
ARM v7A: These are the first of the Cortex series. Cortex-A5 is the entry-level for a linux class device in 2018.
ARM v7M, ARM v6M: These are the common microcontroller devices like your STM32, and already these have over 10 years of history
ARM v8A: These introduce the 64 bit instruction set (T32/A32/A64 in one device), already entry level in the R-pi 3 for example.
ARM v8M: The latest iteration of an microcontroller architecture with more advanced security features, just starting to become available 2018Q2
Specifically, ARMv6M/ARMv7M/ARMv8M provide a very different exception model compared with all of the other ARM architectures (remaining similar within the family), whilst many of the other differences are more incremental or focused on specialised area.

Uninitialized, writable data before data segment

I'm writing a simple program that converts brainfuck code into x86_64 assembly. Part of that involves creating a large zero-initialized array at the beginning of the program. Thus, each compiled program starts with the following assembly code:
.data
ARR:
.space 32430
.text
.globl _start
.type _start, #function
_start:
... #code as compiled from the brainfuck program
...
From there the compiled program is supposed to be able to access any part of that array, but it should segfault if it tries to access memory before or after it.
Because the array is followed directly by a .text section, which by my understanding is read only, and because it is the first section of the program, I expected that my desired behavior would follow naturally. Unfortunately, this is not the case: compiled programs are able to access non-zero initialized data to the left of (that is, at lower addresses than) the beginning of the array.
Why is this the case and is there anything I can include in the assembly code that would prevent it?
This is, of course, highly system-dependent, but since your observations suit a typical Linux/GNU system, I'll refer to such a system.
what I assume is that the linker isn't putting my segments where I think it is.
True, the linker puts the segments not in the order they appear in your code snippet, but rather .text first, .data second. We can see this e. g. with
> objdump -h ARR
ARR: file format elf32-i386
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000042 08048074 08048074 00000074 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00007eae 080490b8 080490b8 000000b8 2**2
CONTENTS, ALLOC, LOAD, DATA
compiled programs are able to access non-zero initialized data to the left of (that is, at lower addresses than) the beginning of the array.
Why is this the case …
As we also see in the above example, the .data section is linked at memory address 080490b8. Although memory pages have the length PAGE_SIZE (here getconf PAGE_SIZE yields 4096, i. e. 100016) and start at multiples of that size, the data starts at an address offset equal to the file offset 000000b8 (where the data is stored in the disk file), because the file pages containing the .data section are mapped into memory as copy-on-write pages. The non-zero initialized data below the .data section is just what happens to be in the first file page at bytes 0 to b716, including .text.
… is there anything I can include in the assembly code that would prevent it?
I'd prefer a solution that places my segments such that a bad array access causes a segfault.
As Margaret Bloom and Ped7g hinted at, you could allocate additional data below ARR and create an inaccessible guard page. This can be achieved with minimal effort by aligning ARR to the next page address. The example program below implements this and allows to test it by accepting an index argument (optionally negative) with which the ARR data is accessed; if within bounds, it should exit with status 0, otherwise segfault. Note: This method works only if the .text section does not end at a page boundary, because if it does, the .align 4096 is without effect; but since the assembly code is created with a converter program, that program should be able to check this and add a few extra .text bytes if needed.
.data
.align 4096
ARR:
.space 30000 # we'll actually get 32768
.text
.globl _start
.type _start, #function
_start:
mov (%esp),%ebx # argc
cmp $1,%ebx
jbe 9f
mov $0,%ax
mov $1,%ebx # sign 1
mov 8(%esp),%esi # argv[1]
0: movb (%esi),%cl # convert argument string to integer
jcxz 1f
sub $'0',%cl
js 2f
mov $10,%dx
mul %dx
add %cx,%ax
jmp 3f
2: neg %ebx # change sign
3: add $1,%esi
jmp 0b
1: mul %ebx # multiply with sign 1 or -1
movzx ARR(%eax),%ebx# load ARR[atoi(argv[1])]
9: mov $1,%eax
int $128 # _exit(ebx);

Is my MIPS compiler crazy, or am I crazy for choosing MIPS?

I am using a MIPS CPU (PIC32) in an embedded project, but I am starting to question my choice.
I understand that a RISC CPU like MIPS will generate more instructions than one might expect, but I didn't think it would be like this. Here is a snippet from the disassembly listing:
225: LATDSET = 0x0040;
sw s1,24808(s2)
sw s4,24808(s2)
sw s4,24808(s2)
sw s1,24808(s2)
sw s4,24808(s3)
sw s4,24808(s3)
sw s1,24808(s3)
226: {
227: porte = PORTE;
lw t1,24848(s4)
andi v0,t1,0xffff
lw v1,24848(s6)
andi ra,v1,0xffff
lw v1,24848(s6)
andi ra,v1,0xffff
lw v0,24848(s6)
andi t2,v0,0xffff
lw a2,24848(s5)
andi v1,a2,0xffff
lw t2,24848(s5)
andi v1,t2,0xffff
lw v0,24848(s5)
andi t2,v0,0xffff
228: if (porte & 0x0004)
andi t2,v0,0x4
andi s8,ra,0x4
andi s8,ra,0x4
andi ra,t2,0x4
andi a1,v1,0x4
andi a2,v1,0x4
andi a2,t2,0x4
229: pst_bytes_somi[0] |= sliding_bit;
or t3,t4,s0
xori a3,t2,0x0
movz t3,s0,a3
addu s0,t3,zero
or t3,t4,s1
xori a3,s8,0x0
movz t3,s1,a3
addu s1,t3,zero
or t3,t4,s1
xori a3,s8,0x0
movz t3,s1,a3
addu s1,t3,zero
or v1,t4,s0
xori a3,ra,0x0
movz v1,s0,a3
addu s0,v1,zero
or a0,t4,s2
xori a3,a1,0x0
movz a0,s2,a3
addu s2,a0,zero
or t3,t4,s2
xori a3,a2,0x0
movz t3,s2,a3
addu s2,t3,zero
or v1,t4,s0
xori a3,a2,0x0
movz v1,s0,a3
This seems like a crazy number of instructions for simple reading / writing and testing variables at fixed addresses. On a different CPU, I could probably get each C statement down to about 1..3 instructions, without resorting to hand-written asm. Obviously the clock rate is fairly high, but it's not 10x higher than what I would have in a different CPU (e.g. dsPIC).
I have optimisation set to maximum. Is my C compiler terrible (It's gcc 3.4.4)? Or is this typical of MIPS?
Finally figured out the answer. The disassembly listing is totally misleading. The compiler is doing loop unrolling, and what we're seeing under each C statement is actually 8x the number of instructions, because it's unrolling the loop 8x. The instructions are not at consecutive addresses! Turning off loop unrolling in the compiler options produces this:
225: LATDSET = 0x0040;
sw s3,24808(s2)
226: {
227: porte = PORTE;
lw t1,24848(s5)
andi v0,t1,0xffff
228: if (porte & 0x0004)
andi t2,v0,0x4
229: pst_bytes_somi[0] |= sliding_bit;
or t3,t4,s0
xori a3,t2,0x0
movz t3,s0,a3
addu s0,t3,zero
230:
Panic over everyone.
I think your compiler is misbehaving...
Check for example this statement:
228: if (porte & 0x0004)
andi t2,v0,0x4 (1)
andi s8,ra,0x4 (2)
andi s8,ra,0x4 (3)
andi ra,t2,0x4 (4)
andi a1,v1,0x4 (5)
andi a2,v1,0x4 (6)
andi a2,t2,0x4 (7)
It is obvious that there are instructions that basically do nothing. Instruction (3) does nothing as new as stores in s8 the same result computed by instruction (2).
Instruction (6) also has no effect, as it is overriden by the next instruction (7),
I believe any compiler which does some static analysis phase would at least remove instructions (3) and (6).
Similar analysis would apply to other portions of your code. For example in the first statement you can see some registers (v0 and v0) is loaded with the same value twice.
I think your compiler is not doing a good job at optimizing the compiled code.
MIPS is basically the embodiment of everything that was stupid about RISC design. These days x86 (and x86_64) have absorbed pretty much all the worthwhile ideas out of RISC, and ARM has evolved to be much more efficient than traditional RISC while still staying true to the RISC concept of keeping a small, systematic instruction set.
To answer the question, I'd say you're crazy for choosing MIPS, or perhaps more importantly, for choosing it without first learning a bit about the MIPS ISA and why it's so bad and how much inefficiency you need to put up with if you want to use it. I'd choose ARM for low-power/embedded systems in most situations, or better yet Intel Atom if you can afford a bit more power consumption.
Edit: Actually, a second reason you may be crazy... From the comments, it seems you're using 16-bit integers. You should never use smaller-than-int types in C except in arrays or in a structure that will be allocated in large numbers (either in an array or some other way such as a linked list/tree/etc.). Using small types will never give any benefit except for saving space (which is irrelevant until you have a large number of values of such type) and is almost surely less efficient than using "normal" types. In the case of MIPS, the difference is extreme. Switch to int and see if your problem goes away.
The only thing that I can think of is perhaps, perhaps, the compiler might be injecting extra nonsense instructions to mate up the speed of the CPU with a much slower data bus speed. Even that explanation isn't quite sufficient, as the store / load instructions similarly have redundancy.
As the compiler is suspect, don't forget that focusing efforts into the compiler can blind you to a type of tunnel vision. Perhaps errors are latent in other parts of the tool chain too.
Where did you get the compiler? I find that some of the "easy" sources often ship some pretty horrible tools. Embedded development friends of mine typically compile their own tool chain with sometimes much better results.
I tried compiling the following code with CodeSourcery MIPS GCC 4.4-303 with -O4. I tried it with uint32_t and uint16_t:
#include <stdint.h>
void foo(uint32_t PORTE, uint32_t pst_bytes_somi[], uint32_t sliding_bit) {
uint32_t LATDSET = 0x0040;
{
uint32_t porte = PORTE;
if (porte & 0x0004)
pst_bytes_somi[0] |= sliding_bit;
if (porte & LATDSET)
pst_bytes_somi[1] |= sliding_bit;
}
}
Here is the disassembly with uint32_t integers:
uint32_t porte = PORTE;
if (porte & 0x0004)
0: 30820004 andi v0,a0,0x4
4: 10400004 beqz v0,18 <foo+0x18>
8: 00000000 nop
./foo32.c:7
pst_bytes_somi[0] |= sliding_bit;
c: 8ca20000 lw v0,0(a1)
10: 00461025 or v0,v0,a2
14: aca20000 sw v0,0(a1)
./foo32.c:8
if (porte & LATDSET)
18: 30840040 andi a0,a0,0x40
1c: 10800004 beqz a0,30 <foo+0x30>
20: 00000000 nop
./foo32.c:9
pst_bytes_somi[1] |= sliding_bit;
24: 8ca20004 lw v0,4(a1)
28: 00463025 or a2,v0,a2
2c: aca60004 sw a2,4(a1)
30: 03e00008 jr ra
34: 00000000 nop
Here is the disassembly with uint16_t integers:
if (porte & 0x0004)
4: 30820004 andi v0,a0,0x4
8: 10400004 beqz v0,1c <foo+0x1c>
c: 30c6ffff andi a2,a2,0xffff
./foo16.c:7
pst_bytes_somi[0] |= sliding_bit;
10: 94a20000 lhu v0,0(a1)
14: 00c21025 or v0,a2,v0
18: a4a20000 sh v0,0(a1)
./foo16.c:8
if (porte & LATDSET)
1c: 30840040 andi a0,a0,0x40
20: 10800004 beqz a0,34 <foo+0x34>
24: 00000000 nop
./foo16.c:9
pst_bytes_somi[1] |= sliding_bit;
28: 94a20002 lhu v0,2(a1)
2c: 00c23025 or a2,a2,v0
30: a4a60002 sh a2,2(a1)
34: 03e00008 jr ra
38: 00000000 nop
As you can see each C statement maps into two to three instructions. Using 16 bit integers makes the function only one instruction longer.
Have you turned on compiler optimizations? Unoptimzied code has much redundancy in it.

Resources