Interpreting jump tables / branch tables

Interpreting jump tables / branch tables - arm

I've been slowly picking things up with assembly. I am working on a Canon Rebel T1i, here is a small snippet of a code flow chart that I am trying to understand. To my knowledge, I believe the camera has a 132MHz ARM v5 processor:
http://i.imgur.com/PtWC9.png
I have searched the bottom of google attempting to understand how jump tables work, and no matter how much I read I just can't connect things together to understand it. I understand a jump table is similar to a case statement, but I don't understand just how it moves through the table.
Ex: in this example there is only one CMP operation, so I don't understand how exactly this is working. Any help will be greatly appreciated!!

I dont think you have enough info on the screen shot to understand how it connects to your question. But a jump table in general...
In C think of an array of functions, and you have initialized each element in the array of functions, at some point later your code makes some decision and uses an index to choose one of those functions. As you mentioned a case statement, could be implemented that way but that would be the exception not the rule, all depends on the variable being used in the switch and the size/width/nature of the elements in the case statement.
You have been picking up assembly, so you understand registers, doing math with registers, storing things in registers, etc. The program counter can be used by many instructions as just another register, the difference is when you write something to it, you change what instruction is executed next.
Lets try a case statement example:
switch(bob&3)
{
case 0: ted(); break;
case 1: joe(); break;
case 2: jim(); bob=2; break;
case 3: tim(); bob=7; break;
}
What you COULD (probably would not) do is:
casetable:
.word a
.word b
.word c
.word d
caseentry:
ldr r1,=bob
ldr r0,[r1]
ldr r2,=casetable
and r0,#3
ldr pc,[r2,r0,lsl #2]
a:
bl ted
b caseend
b:
bl joe
b caseend
c:
bl jim
mov r0,#2
ldr r1,=bob
str r0,[r1]
b caseend
d:
bl tim
mov r0,#7
ldr r1,=bob
str r0,[r1]
b caseend
caseend:
So the four words after the label casetable: are the addresses where the code starts for each of the cases, case0 starts at a: case1 code starts at b: and so on. What we need to do is take the variable used by the switch statement and mathematically compute an address for the item in the table. Then we need to load the address from the table into the program counter. Writing to the program counter is the same as performing a jump.
So the C sample was crafted intentially to make this easy. First load the contents of the bob variable into r0. And it with 3. The items in the jump table are 32 bit addresses, or 4 bytes so we need to multiply r0 times 4 to get the offset in the table. A shift left of 2 is the same as a multiply by 4. And we need to add r0<<2 to the base address for the jump table. So essentially we are computing address_of(casetable)+((bob&3)<<2) The read memory at that computed address and load that value into the program counter.
With arm (you mentioned this was arm) you can do much of this in one instruction:
ldr pc,[r2,r0,lsl #2]
Load into the register pc, the contents of the memory location [r2+(r0<<2)]. r2 is the address of casetable, and r0 is bob&3.
Basically a jump table boils down to mathmatically computing an offset into a table of addresses. The table of addresses are addresses you want to jump/branch to depending on one of the parameters used in the math operation, in my example above bob is that variable. And the addresses a,b,c,d are the address choices I want to pick from based on the contents of bob. There are a zillion fun and interesting ways to do this sort of thing, but it all boils down to computing at runtime the address to branch to, and shoving that address into the program counter in a way that causes the particular processor to perform what is essentially a jump.
Note another, perhaps easier to read way to compute and jump in my example would be:
mov r3,r0,lsl #2
add r3,r2
bx r3
The cores that support thumb use the bx instruction with a register often, normally you see bx lr to return from a branch link (subroutine) call. bx lr means pc = lr. bx r3 means pc = r3.
I hope this is what you were asking about, if I have misunderstood the question, please elaborate.
EDIT:
Looking at the code on your screen shot.
cmp r0,#4
addls pc,pc,r0,lsl #2
The optional math (ADDLS add if lower or same) computes the new program counter value (a jump table is a computation stored in the program counter) based on the program counter itself plus an offset r0 times 4. For arm processors, at the time of execution, the program counter is two instructions ahead. so, mixing those two lines of code and a portion of my example:
cmp r0,#4
addls pc,pc,r0,lsl #2
ldr pc,=a
ldr pc,=b
ldr pc,=c
ldr pc,=d
...
At the time addls is executed the program counter contains the address for the ldr pc,=b instruction. So if r0 contains a 0 then 0<<2 = 0, pc plus 0 would branch to the ldr pc,=b instruction then that instruction causes a branch to the b: label. if r0 contained a 1 at the time of addls then you would execute the ldr pc,=c instruction next and so on. You can make a table as deep as you want this way. Also note that since the add is conditional, if the condition does not happen you will execute that first instruction after the addls, so maybe you want that to be an unconditional branch to branch over the table, or branch backward an loop or maybe it is a nop so that you fall into the first jump, or what I did above is have it branch to some other place. So to understand what is going on you need to example the instructions that follow the addls to figure out what the possible jump table destinations are.

Related

What would be a reason to use ADDS instruction instead of the ADD instruction in ARM assembly?

My course notes always use ADDS and SUBS in their ARM code snippets, instead of the ADD and SUB as I would expect. Here's one such snippet for example:
__asm void my_capitalize(char *str)
{
cap_loop
LDRB r1, [r0] // Load byte into r1 from memory pointed to by r0 (str pointer)
CMP r1, #'a'-1 // compare it with the character before 'a'
BLS cap_skip // If byte is lower or same, then skip this byte
CMP r1, #'z' // Compare it with the 'z' character
BHI cap_skip // If it is higher, then skip this byte
SUBS r1,#32 // Else subtract out difference to capitalize it
STRB r1, [r0] // Store the capitalized byte back in memory
cap_skip
ADDS r0, r0, #1 // Increment str pointer
CMP r1, #0 // Was the byte 0?
BNE cap_loop // If not, repeat the loop
BX lr // Else return from subroutine
}
This simple code for example converts all lowercase English in a string to uppercase. What I do not understand in this code is why they are not using ADD and SUB commands instead of ADDS and SUBS currently being used. The ADDS and SUBS command, afaik, update the APSR flags NZCV, for later use. However, as you can see in the above snippet, the updated values are not being utilized. Is there any other utility of this command then?

Arithmetic instructions (ADD, SUB, etc) don't modify the status flag, unlike comparison instructions (CMP,TEQ) which update the condition flags by default. However, adding the S to the arithmetic instructions(ADDS, SUBS, etc) will update the condition flags according to the result of the operation. That is the only point of using the S for the arithmetic instructions, so if the cf are not going to be checked, there is no reason to use ADDS instead of ADD.
There are more codes to append to the instruction (link), in order to achieve different purposes, such as CC (the conditional flag C=0), hence:
ADDCC: do the operation if the carry status bit is set to 0.
ADDCCS: do the operation if the carry status bit is set to 0 and afterwards, update the status flags (if C=1, the status flags are not overwritten).
From the cycles point of view, there is no difference between updating the conditional flags or not. Considering an ARMv6-M as example, ADDS and ADD will take 1 cycle.
Discard the use of ADD might look like a lazy choice, since ADD is quite useful for some cases. Going further, consider these examples:
SUBS r0, r0, #1
ADDS r0, r0, #2
BNE go_wherever
and
SUBS r0, r0, #1
ADD r0, r0, #2
BNE go_wherever
may yield different behaviours.
As old_timer has pointed out, the UAL becomes quite relevant on this topic. Talking about the unified language, the preferred syntax is ADDS, instead of ADD (link). So the OP's code is absolutely fine (even recommended) if the purpose is to be assembled for Thumb and/or ARM (using UAL).

ADD without the flag update is not available on some cortex-ms. If you look at the arm documentation for the instruction set (always a good idea when doing assembly language) for general purpose use cases that is not available until a thumb2 extension on armv7-m (cortex-m3, cortex-m4, cortex-m7). The cortex-m0 and cortex-m0+ and generally wide compatibility code (which would use armv4t or armv6-m) doesn't have an add without flags option. So perhaps that is why.
The other reason may be to get the 16-bit instruction not the 32, but but that is a slippery slope as it gets even more into assemblers and their syntax (syntax is defined by the assembler, the program that processes assembly language, not the target). For example not syntax unified gas:
.thumb
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
The disassembler knows reality but the assembler doesn't:
so.s: Assembler messages:
so.s:2: Error: instruction not supported in Thumb16 mode -- `adds r1,r2,r3'
but
.syntax unified
.thumb
adds r1,r2,r3
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: eb02 0103 add.w r1, r2, r3
So not slippery in this case, but with the unified syntax you start to get into blahw, blah.w, blah, type syntax and have to spin back around to check to see that the instructions you wanted are being generated. Non-unified has its own games as well, and of course all of this is assembler-specific.
I suspect they were either going with the only choice they had, or were using the smaller and more compatible instruction, especially if this were a class or text, the more compatible the better.

Sorting ARM Assembly

I am newbie. I have difficulties with understanding memory ARM memory map.
I have found example of simple sorting algorithm
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortc
; r0 = &arr[0]
; r1 = length
__sortc
stmfd sp!, {r2-r9, lr}
mov r4, r1 ; inner loop counter
mov r3, r4
sub r1, r1, #1
mov r9, r1 ; outer loop counter
outer_loop
mov r5, r0
mov r4, r3
inner_loop
ldr r6, [r5], #4
ldr r7, [r5]
cmp r7, r6
; swap without swp
strls r6, [r5]
strls r7, [r5, #-4]
subs r4, r4, #1
bne inner_loop
subs r9, r9, #1
bne outer_loop
ldmfd sp!, {r2-r9, pc}^
END
And this assembly should be called this way from C code
#define MAX_ELEMENTS 10
extern void __sortc(int *, int);
int main()
{
int arr[MAX_ELEMENTS] = {5, 4, 1, 3, 2, 12, 55, 64, 77, 10};
__sortc(arr, MAX_ELEMENTS);
return 0;
}
As far as I understand this code creates array of integers on the stack and calls _sortc function which implemented in assembly. This function takes this values from the stack and sorts them and put back on the stack. Am I right ?
I wonder how can I implement this example using only assembly.
For example defining array of integers
DCD 3, 7, 2, 8, 5, 7, 2, 6
BTW Where DCD declared variables are stored in the memory ??
How can I operate with values declared in this way ? Please explain how can I implement this using assembly only without any C code, even without stack, just with raw data.
I am writing for ARM7TDMI architecture

AREA ARM, CODE, READONLY - this marks start of section for code in the source.
With similar AREA myData, DATA, READWRITE you can start section where it's possible to define data like data1 DCD 1,2,3, this will compile as three words with values 1, 2, 3 in consecutive bytes, with label data1 pointing to the first byte of first word. (some AREA docs from google).
Where these will land in physical memory after loading executable depends on how the executable is linked (linker is using a script file which is helping him to decide which AREA to put where, and how to create symbol table for dynamic relocation done by the executable loader, by editing the linker script you can adjust where the code and data land, but normally you don't need to do that).
Also the linker script and assembler directives can affect size of available stack, and where it is mapped in physical memory.
So for your particular platform: google for memory mappings on web and check the linker script (for start just use linker option to produce .map file to see where the code and data are targeted to land).
So you can either declare that array in some data area, then to work with it, you load symbol data1 into register ("load address of data1"), and use that to fetch memory content from that address.
Or you can first put all the numbers into the stack (which is set probably to something reasonable by the OS loader of your executable), and operate in the code with the stack pointer to access the numbers in it.
You can even DCD some values into CODE area, so those words will end between the instructions in memory mapped as read-only by executable loader. You can read those data, but writing to them will likely cause crash. And of course you shouldn't execute them as instructions by accident (forgetting to put some ret/jump instruction ahead of DCD).
without stack
Well, this one is tricky, you have to be careful to not use any call/etc. and to have interrupts disabled, etc.. basically any thing what needs stack.
When people code a bootloader, usually they set up some temporary stack ASAP in first few instructions, so they can use basic stack functionality before setting up whole environment properly, or loading OS. A space for that temporary stack is often reserved somewhere in/after the code, or an unused memory space according to defined machine state after reset.
If you are down to the metal, without OS, usually all memory is writeable after reset, so you can then intermix code and data as you wish (just jumping around the data, not executing them by accident), without using AREA definitions.
But you should make your mind, whether you are creating application in user space of some OS (so you have things like stack and data areas well defined and you can use them for your convenience), or you are creating boot loader code which has to set it all up for itself (more difficult, so I would suggest at first going into user land of some OS, having C wrapper around with clib initialized is often handy too, so you can call things like printf from ASM for convenient output).
How can I operate with values declared in this way
It doesn't matter in machine code, which way the values were declared. All that matters is, if you have address of the memory, and if you know the structure, how the data are stored there. Then you can work with them in any way you want, using any instruction you want. So body of that asm example will not change, if you allocate the data in ASM, you will just pass the pointer as argument to it, like the C does.
edit: some example done blindly without testing, may need further syntax fixing to work for OP (or maybe there's even some bug and it will not work at all, let me know in comments if it did):
AREA myData, DATA, READWRITE
SortArray
DCD 5, 4, 1, 3, 2, 12, 55, 64, 77, 10
SortArrayEnd
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortasmarray
__sortasmarray
; if "add r0, pc, #SortArray" fails (code too far in memory from array)
; then this looks like some heavy weight way of loading any address
; ldr r0, =SortArray
; ldr r1, =SortArrayEnd
add r0, pc, #SortArray ; address of array
; calculate array size from address of end
; (as I couldn't find now example of thing like "equ $-SortArray")
add r1, pc, #SortArrayEnd
sub r1, r1, r0
mov r1, r1, lsr #2
; do a direct jump instead of "bl", so __sortc returning
; to lr will actually return to called of this
b __sortc
; ... rest of your __sortc assembly without change
You can call it from C code as:
extern void __sortasmarray();
int main()
{
__sortasmarray();
return 0;
}
I used among others this Introducing ARM assembly language to refresh my ARM asm memory, but I'm still worried this may not work as is.
As you can see, I didn't change any thing in the __sortc. Because there's no difference in accessing stack memory, or "dcd" memory, it's the same computer memory. Once you have the address to particular word, you can ldr/str it's value with that address. The __sortc receives address of first word in array to sort in both cases, from there on it's just memory for it, without any context how that memory was defined in source, allocated, initialized, etc. As long as it's writeable, it's fine for __sortc.
So the only "dcd" related thing from me is loading array address, and the quick search for ARM examples shows it may be done in several ways, this add rX, pc, #label way is optimal, but does work only for +-4k range? There's also pseudo instruction ADR rX, #label doing this same thing, and maybe switching to other in case of range problem? For any range it looks like ldr rX, = label form is used, although I'm not sure if it's pseudo instruction or how it works, check some tutorials and disassembly the machine code to see how it was compiled.
It's up to you to learn all the ARM assembly peculiarities and how to load addresses of arrays, I don't need ARM ASM at the moment, so I didn't dig into those details.
And there should be some equ way to define length of array, instead of calculating it in code from end address, but I couldn't find any example, and I'm not going to read full Assembler docs to learn about all it's directives (in gas I think ArrayLength equ ((.-SortArray)/4) would work).

Cycles per instruction in delay loop on arm

I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:
void delay(volatile uint32_t num) {
volatile uint32_t index = 0;
for(index = (6000 * num); index != 0; index--) {}
}
The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).
The assembler is this:
delay:
# args = 0, pretend = 0, frame = 16
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
sub sp, sp, #16 stack pointer = stack pointer - 16
movs r3, #0 move 0 into r3 and update condition flags
str r0, [sp, #4] store r0 at location stack pointer+4
str r3, [sp, #12] store r3 at location stack pointer+12
ldr r3, [sp, #4] load r3 with data at location stack pointer+4
movw r2, #6000 move 6000 into r2 (make r2 6000)
mul r3, r2, r3 r3 = r2 * r3
str r3, [sp, #12] store r3 at stack pointer+12
ldr r3, [sp, #12] load r3 with data at stack pointer+12
cbz r3, .L1 Compare and Branch on Zero
.L4:
ldr r3, [sp, #12] 2 load r3 with data at location stack pointer+12
subs r3, r3, #1 1 subtract 1 from r3 with 'set APSR flag' if any conditions met
str r3, [sp, #12] 2 store r3 at location sp+12
ldr r3, [sp, #12] 2 load r3 with data at location sp+12
cmp r3, #0 1 status = 0 - r3 (if r3 is 0, set status flag)
bne .L4 1 branch to .L4 if not equal
.L1:
add sp, sp, #16 add 16 back to the stack pointer
# sp needed
bx lr
.size delay, .-delay
.align 2
.global blink
.thumb
.thumb_func
.type blink, %function
I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.
Background:
In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.
Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.
Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?
TIA, Pete
Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.

First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).
About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.
What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.
For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :
void wait_us( uint32_t us ) {
uint32_t mark = GET_TIMER();
us *= TIMER_FREQ/1000000;
while( us > GET_TIMER() - mark );
}
You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :
uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );
For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :
#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
uint32_t l = n/CYCLES_PER_LOOP;
asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}
This is very short, precise, and won't be affected by compiler flags nor bus load.
You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).

this is a cortex-m3 so you are likely running out of flash? did you try running from ram and/or adjust the flash speed, or adjust the clocks vs flash speed (slow the main clock) so you can get the flash to as close to a single cycle per access as you can.
you are also doing a memory access for half of those instructions which is a cycle or more for the fetch (one if you are on sram running on the same clock) and another clock for the ram access (due to using volatile). so that could account for some percentage of the difference between one clock per and two clocks per, the branch might cost more than one clock as well, on an m3 not sure if you can turn that on or off (branch prediction) and branch prediction is a bit funny the way it works anyway, if it is too close to the beginning of a fetch block then it wont work, so where the branch is in ram can affect the performance, where any of this is in ram can affect the performance, you can do experiments by adding nops anywhere in front of the code to change the alignment of the loop, affects caches (which you likely dont have here) and can also affect other things based on how big and where the instructions lie in a fetch. (some arms fetch 8 instructions at a time for example).
not only do you need to know assembly to understand what you are trying to do but how to manipulate that assembly and other things like alignment, re-arranging the instruction mix, sometimes more instructions is faster than fewer and so on. pipelines and caches are difficult at best to predict if at all, and can easily throw off assumptions and experiments with hand optimized code.
even if you overcome the slow flash, lack of a cache (although you cannot rely on its performance), and other things, the logic between the core and the I/O and the speed of the I/O for bit banging might be another performance hit, no reason to expect the I/O to be a small number of cycles per access, it might even be double digit number of clocks. very early in this research you need to start gpio read only loops, write only loops, and read/write loops. If you are relying on the gpio logic to only touch one bit in a port rather than the whole port that might have a cycle cost so you need to performance tune that as well.
you might want to look into using a cpld if you are even close to the margin on timing and have to be hard real time, as one extra line of code or a new rev of the compiler can completely throw off the timing of the project.

ARMv7 assembly store values in an array

I'm trying to add a user inputted number to a every element in an array. I had everything working until I realized that the original array was not being updated. Simple, I thought, just store the value back into the array and move on with life. Sadly, this doesn't seem to be quite so simple.
As the title suggests, I'm using ARMv7 and writing assembly. I've been using this guide to understand the basics and to have some good code to look at. When I run the example code given here it works fine: str r2, [r3] puts whatever is in r2 into what r3 points at. The following is my attempt to do the same thing which gives me a Signal 11 occurred: SIGSEGV (Invalid memory segment access) and Execution stopped at: 0x0000580C STR r3,[r5,#0]:
# Loop and add value to all values in array regardless of array length
# Setup loop
# r4 comes from above and the scanf value, I've checked the registers and the value is correct
mov r0, #0
ldr r1, =array_b
ldr r2, addrArr
loop: # Start loop to add inputed number to every value in array
add r3, r2, r0
ldr r3, [r3]
add r3, r3, r4 # Add input to each index in array
add r5, r2, r0 # Pointer to location in array
str r3, [r5] # Put new value into array
cmp r0, r1 # Check for end of array
addne r0, r0, #4 # Not super necessary but it shows one of the cool things ARM can do, condition math
bne loop # Branch if not equal
beq doneLoop # Branch if equal
doneLoop: # End loop
Here are the vars
.align 2
array:
.word 0
.word 1
.word 2
.word 3
.word 4
.word 5
.word 6
.word 7
.equ array_b, .-array
addrArr: .word array
My understanding is that str takes the source first and the destination second (which is different from other instructions for some reason). So r5 is used to calculate where in the array to store the value and r3 has the value from the add instruction. I've checked and the value in r5 is valid, ie: it's the start of the array and the array_b is the proper length (32 in this case). I've also tried doing =array instead of addrArr but they give the same value and the same segfault message.

This is because there is historically in systems two major kind of memories :
ROM, Read Only Memory, cannot be written to and can store only the program and constant data
RAM, Random Access Memory, can be both read and written to. It is used to store variables.
Many systems do no use ROM directly, instead the data can be loaded from an other permanent support, for example a floppy disc, a tape or a hard disc into RAM. In order to avoid a program writing to RAM memory that wasn't supposed to be written, the RAM can be divided in multiple areas, using segmented memory.
Not all system features this, so it really depends on the architecture. If segmented memory is used, it basically makes the processor quit the application when you try to write to a segment of RAM that is designed to be read only. This is exactly what appears to be your problem here.
In order to solve this you should declare your array, which is a variable and should be stocked in RAM, by precedding it by .data.
On the other hand your executable instructions should be placed in the read only segment marked with the assembler directive .text

How to properly create an array in ARM assembly?

I'm currently learning ARM assembly for a class and have come across a problem where I'd need to use an "array." I'm aware that there is no such thing as an array in ARM so I have to allocate space and treat that as an array. I have two questions.
Am I correctly adding new values to the array or am I merely overwriting the previous value? If I am overwriting the values, how do I go about adding new values?
How do I go about looping through the different values of the array? I know I have to use loop: but don't know how to use it to access different "indexes."
So far, this is what I've gotten from reading ARM documentation as I've gathered from resources online.
.equ SWI_Exit, 0x11
.text
.global _start
_start: .global _start
.global main
b main
main:
ldr R0, =MyArray
mov R1, #42
str R1, [R0], #4
mov R1, #43
str R1, [R0], #4
swi SWI_Exit
MyArray: .skip 20 * 4
.end
As a side note, I am using ARMSim# as required by my professor, so some commands recognized by GNU tools won't be recognized by ARMSim#, or at least I believe that is the case. Please correct me if I'm wrong.

You are just overwriting elements. At this level there are "such things as arrays", but only fixed-sized, preallocated arrays. The .skip is allocating the fixed-size array.* A variable-sized, growable array would typically be implemented with more complex dyanamic memory allocation code using the stack or a heap.
If you had a label like loop: (the actual name is arbitrary) you could branch (back) to it by using b loop. (Probably, you would want to do the branch conditionally so that the program didn't loop forever.) You can access different elements in the loop by changing R0, which you are already doing
Also the b main isn't really serving any purpose since it is branching to he next instruction. The code will do the same thing if you remove it.
[*] Alternately, you could say that your array is the just elements between MyArray and R0 (not including the memory R0 points to), in which, by changing R0 you are extending the array. But the maximum size Is still fixed by the .skip directive.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight