Memory copying: ARM STM vs. ARM NEON

Memory copying: ARM STM vs. ARM NEON - c

I need to copy large amounts of memory (on the order of 47k) (example, from a USB buffer to a more permanent buffer).
This is using an ARM Cortex A8.
(The ARM has the NEON code.)
The ARM NEON instruction can copy 4 32-bit elements at a time (per instruction).
The ARM LDM and STM instructions can load and store (copy) more than 4 registers at a time (per instruction).
Questions:
Which is more efficient for copying large amounts (e.g. 47k) of memory, the ARM NEON instruction or the ARM LDM and STM instructions? (I don't have benchmarking tools available; this is on an embedded system).
What is the advantage of the ARM NEON instructions for copying memory?
The project is primarily C language, but also has some assembly language.
Is there a method to suggest to the compiler to use ARM NEON or the LDM/STM instructions, without optimizations? (We are launching code without optimizations so there are no differences when the product is returned. There is a possibility that optimization can be responsible for issues in the product.)
Tools:
ARM Cortex A8 processor
IAR Electronic Workbench IDE & Compiler.
Development on Windows 10 PC, to remote embedded ARM processor (via JTAG).

Neon has the advantage of unaligned load and store, but it consumes more power.
And since you are copying form the USB buffer to a permant one where you have full control over alignment and size, it would be better without neon, because memory speed is the same.
The standard memcpy most probably already utilizes neon (it depends on the BSP), hence I'd write a mini version utilizing ldrd and strd which is slightly faster than ldm and stm.
.balign 64
push {r4-r11}
sub r1, r1, #8
sub r0, r0, #8
b 1f
.balign 64
1:
ldrd r4, r5, [r1, #8]
ldrd r6, r7, [r1, #16]
ldrd r8, r9, [r1, #24]
ldrd r10, r11, [r1, #32]!
subs r2, r2, #32
strd r4, r5, [r0, #8]
strd r6, r7, [r0, #16]
strd r8, r9, [r0, #24]
strd r10, r11, [r0, #32]!
bgt 1b
.balign 16
pop {r4-r11}
bx lr
I think you have no problem making the buffer size a multiple of 32, and both buffers aligned to 64 bytes(cache line length) or even better, 4096 bytes (page size).

Related

How to get qemu to run an arm thumb binary?

I'm trying to learn the basics of ARM assembly and wrote a fairly simple program to sort an array. I initially assembled it using the armv8-a option and ran the program under qemu while debugging with gdb. This worked fine and the program initialized the array and sorted it as expected.
Ultimately I would like to be able to write some assembly for my Raspberry Pi Pico, which has an ARM Cortex M0+, which I believe uses the armv6-m option. However, when I change the directive in my code, it compiles fine but behaves strangely in that the program counter increments by 4 after every instruction instead of the 2 that I expect for thumb. This is causing my program to not work correctly. I suspect that qemu is trying to run my code as if it were compiled for the full ARM instruction set instead of thumb, but I'm not sure why this is.
I am running on Ubuntu Linux 20.04 LTS, using qemu-arm version 4.2.1 (installed from the package manager). Does the qemu-arm executable only run full ARM binaries? If so, is there another qemu package I can install to run a thumb binary?
Here is my code if it is helpful:
.arch armv6-m
.cpu cortex-m0plus
.syntax unified
.thumb
.data
arr: .skip 4 * 10
len: .word 10
.section .text
.global _start
.align 2
_start:
ldr r0, arr_adr # load the address of the start of the array into register 0
movs r1, #0 # clear the counter register
movs r2, #100
init_loop:
str r2, [r0,r1] # store r2's value to the base address of the array plus the offset stored in r1
subs r2, r2, #10 # subtract 10 from r2
adds r1, #4 # add 4 to the offset (1 word in bytes)
cmp r1, #40 # check if we've reached the end of the array
bne init_loop
movs r1, #0 # clear the offset
out_loop:
mov r3, r1 # set the index of the minimum value to the current array index
mov r4, r1 # set the inner loop index to the outer loop index
in_loop:
ldr r5, [r0,r3] # load the minimum index's value to r5
ldr r6, [r0,r4] # load the inner loop's next value to r6
cmp r6, r5 # compare the two values
bge in_loop_inc # if r6 is greater than or equal to r5, increment and restart loop
mov r3, r4 # set the minimum index to the current index
in_loop_inc:
adds r4, #4
cmp r4, #40 # check if at end of array
blt in_loop
ldr r5, [r0,r3] # load the minimum index value into r5
ldr r6, [r0,r1] # load the current outer loop index value into r6
str r6, [r0,r3] # swap the two values
str r5, [r0,r1]
adds r1, #4 # increment outer loop index
cmp r1, #40 # check if at end of array
blt out_loop
loop:
nop
b loop
arr_adr: .word arr
Thank you for your help!

There are a couple of concepts to disentangle here:
(1) Arm vs Thumb : these are two different instruction sets. Most CPUs support both, some support only one. Both are available simultaneously if the CPU supports both. To simplify a little bit, if you jump to an address with the least significant bit set that means "go to Thumb mode", and jumping to an address with that bit clear means "go to Arm mode". (Interworking is a touch more complicated than that, but that's a good initial mental model.) Note that all Arm instructions are 4 bytes long, but Thumb instructions can be either 2 or 4 bytes long.
(2) A-profile vs M-profile : these are two different families of CPU architecture. M-profile is "microcontrollers"; A-profile is "applications processors", which is "(almost) everything else". M-profile CPUs always support Thumb and only Thumb code. A-profile CPUs support both Arm and Thumb. The Raspberry Pi Pico is a Cortex-M0+, which is M-profile.
(3) QEMU system emulation vs user-mode emulation : these are two different QEMU executables which run guest code in different ways. The system emulation binary (typically qemu-system-arm) runs "bare metal code", eg an entire OS. The guest code has full control and can handle exceptions, write to hardware devices, etc. The user emulation binary (typically qemu-arm) is for running Linux user-space binaries. Guest code is started in unprivileged mode and has access to the usual Linux system calls. For system emulation, which CPU is being emulated depends on what machine type you select with the -M or --machine option. For user-mode emulation, the default CPU is "A-profile with all supported features enabled" (this is --cpu max).
You're currently using qemu-arm which means you get user-mode emulation. This should support Thumb binaries, but unless you pass it a --cpu option it will be using an A-profile CPU. I would also suggest using a newer QEMU for M-profile work, because a lot of M-profile CPU bugs have been fixed since version 4.2. I think 4.2 is also too old to have the Cortex-M0 CPU.
GDB should tell you in the PSR what the T bit is set to -- use that to check whether you're in Thumb mode or Arm mode, rather than looking at how much the PC is incrementing by.
There's currently no QEMU system emulation of the Raspberry Pi Pico (though somebody has been doing some experimental work on one). If your assembly is just basic "working with registers and a bit of memory" you can do that with the user-mode emulator. Or you can try the 'microbit' machine model, which is a Cortex-M0 board -- if you're not doing things that are specific to the Pi Pico that might be good enough.

memmap
MEMORY
{
ram : ORIGIN = 0x00000000, LENGTH = 32K
}
SECTIONS
{
.text : { *(.text*) } > ram
}
strap.s
.cpu cortex-m0
.thumb
.syntax unified
.globl reset_entry
reset_entry:
.word 0x20001000
.word reset
.word hang
.word hang
.word hang
.thumb_func
reset:
ldr r0,=0x40002500
ldr r1,=4
str r1,[r0]
ldr r0,=0x40002008
ldr r1,=1
str r1,[r0]
ldr r0,=0x4000251C
ldr r1,=0x30
ldr r2,=0x37
loop_top:
str r1,[r0]
adds r1,r1,#1
ands r1,r1,r2
b loop_top
.thumb_func
hang:
b hang
build
arm-linux-gnueabi-as --warn --fatal-warnings strap.s -o strap.o
arm-linux-gnueabi-ld strap.o -T memmap -o notmain.elf
arm-linux-gnueabi-objdump -D notmain.elf > notmain.list
Check the vector table as a quick check:
Disassembly of section .text:
00000000 <reset_entry>:
0: 20001000 andcs r1, r0, r0
4: 00000015 andeq r0, r0, r5, lsl r0
8: 0000002f andeq r0, r0, pc, lsr #32
c: 0000002f andeq r0, r0, pc, lsr #32
10: 0000002f andeq r0, r0, pc, lsr #32
00000014 <reset>:
14: 4806 ldr r0, [pc, #24] ; (30 <hang+0x2>)
16: 4907 ldr r1, [pc, #28] ; (34 <hang+0x6>)
18: 6001 str r1, [r0, #0]
1a: 4807 ldr r0, [pc, #28] ; (38
Looks good,
run it
qemu-system-arm -M microbit -nographic -kernel notmain.elf
and it will spew out 0123456701234567...until you ctrl-a then x to exit qemu.
Note this binary will not work on a real chip as I am cheating the uart.
You can get your feet wet with this sim. There is also a luminary micro one from the first cortex-m chips and you can limit yourself to armv6m instructions on that platform as well.
qemu and sims like this have very limited value for mcu work since almost all of the work is related to peripherals and pins, and the instruction set is just like the language of a book, French, Russian, English, German, doesn't matter a biology book is a biology book and the book is the goal. The peripherals are specific to the chip (the pico, a specific stm32 chip, a specific TI tiva C chip, etc).

ARM Cortex-M3 boot from RAM initial state

I have two ARM Cortex-M3 chips: STMF103C8T6 and STM32F103VET6.
When set to boot from RAM, initial state of STMF103C8T6 PC register is 0x20000108; 0x200001e0 for STM32F103VET6.
I am unable to find and information about these addresses in the datasheets. Why are they booted this way and where I can find some information about it?
Edit:
To clarify. When chip set to boot from flash, PC register points to the location of the Reset Handler. This address is provided in the reset vector table at address 0x0. But when chip set to boot from RAM, PC points to constant addresses, mentioned above.
Edit 2:
STMF103C8T6 disassembly:
20000000 <Vectors>:
20000000: 20005000 andcs r5, r0, r0
20000004: 2000010f andcs r0, r0, pc, lsl #2
20000008: 2000010d andcs r0, r0, sp, lsl #2
2000000c: 2000010d andcs r0, r0, sp, lsl #2
20000010: 2000010d andcs r0, r0, sp, lsl #2
20000014: 2000010d andcs r0, r0, sp, lsl #2
20000018: 2000010d andcs r0, r0, sp, lsl #2
...
20000108: f000 b801 b.w 2000010e <Reset_Handler>
2000010c <HardFault_Handler>:
2000010c: e7fe b.n 2000010c <HardFault_Handler>
2000010e <Reset_Handler>:
...
STM32F103VET6 disassembly:
20000000 <Vectors>:
20000000: 20005000 andcs r5, r0, r0
20000004: 200001e7 andcs r0, r0, r7, ror #3
20000008: 200001e5 andcs r0, r0, r5, ror #3
2000000c: 200001e5 andcs r0, r0, r5, ror #3
20000010: 200001e5 andcs r0, r0, r5, ror #3
20000014: 200001e5 andcs r0, r0, r5, ror #3
20000018: 200001e5 andcs r0, r0, r5, ror #3
...
200001e0: f000 b801 b.w 200001e6 <Reset_Handler>
200001e4 <HardFault_Handler>:
200001e4: e7fe b.n 200001e4 <HardFault_Handler>
200001e6 <Reset_Handler>:
...

I am unable to find and information about these addresses in the datasheets. Why are they booted this way and where I can find some information about it?
As far as I am aware, there is no official documentation from ST that so much as mentions this behavior, let alone explains it in any detail. The STM32F1 family reference manual states vaguely in section 3.4 ("Boot Configuration") that:
Due to its fixed memory map, the code area starts from address 0x0000 0000 (accessed through the ICode/DCode buses) while the data area (SRAM) starts from address 0x2000 0000 (accessed through the system bus). The Cortex®-M3 CPU always fetches the reset vector on the ICode bus, which implies to have the boot space available only in the code area (typically, Flash memory). STM32F10xxx microcontrollers implement a special mechanism to be able to boot also from SRAM and not only from main Flash memory and System memory.
The only place these addresses and values are referenced at all is in some of their template startup files -- and even then, not all of them. The SPL startup files supplied for the ARM and IAR toolchains lack support for BootRAM; this functionality is only included in the startup files for the GCC and TrueSTUDIO toolchains.
Anyways. Here's my best analysis of the situation.
When a STM32F1 part is reset, the memory block starting at 0x00000000 is mapped based on the configuration of the BOOT pins. When it is set to boot from flash, that block is aliased to flash; when it is set to run from the bootloader, that block is aliased to a block of internal ROM (around or slightly below 0x1FFFF000). However, when it is set to boot from RAM, something very strange happens.
Instead of aliasing that memory block to SRAM, as you would expect, that memory block is aliased to a tiny (16 byte!) ROM. On a STM32F103C8 (medium density) part, this ROM has the contents:
20005000 20000109 20000004 20000004
This data is interpreted as a vector table:
The first word causes the stack pointer to be initialized to 0x20005000, which is at the top of RAM.
The second word is the reset vector, and is set to 0x20000108 (with the low bit set to enable Thumb mode). This address is in RAM as well, a few words beyond the end of the vector table, and it's where you're supposed to put the "magic" value 0xF108F85F. This is actually the instruction ldr.w pc, [pc, #-480], which loads the real reset vector from RAM and branches to it.
The third and fourth words are the NMI and hardfault vectors. They do not have the low bit set, so the processor will double-fault if either of these exceptions occurs while VTOR is still zero. Confusingly, the PC will be left pointing to the vector table in RAM.
The exact contents of this ROM vary slightly from part to part. For example, a F107 (connectivity line) has the ROM contents:
20005000 200001e1 20000004 20000004
which has the same initial SP, but a different initial PC. This is because this part has a larger vector table, and the medium-density address would be inside its vector table.
A full list of the locations and values used is:
Low/medium density: 0x0108 (value: 0xF108F85F)
Low/medium density value line: 0x01CC (value: 0xF1CCF85F)
Note: ST's sample files give the same value as for low/medium density parts. I'm pretty sure this is wrong and have corrected it here, but I don't have any parts of this type to test with. I'd appreciate feedback to confirm if this works.
All others: 0x01E0 (value: 0xF1E0F85F)
Thankfully, this behavior seems to be largely unique to the F103/5/7 family. Newer parts use different methods to control booting which are much more consistent.

How does a compiler/assembler make sense of processor core registers?

My question is specific to arm cortex M3 micro-controllers. Every peripheral on the micro controller is memory mapped and those memory addresses are used in processing.
For Eg.,: GPIOA->ODR = 0;
This will write a 0 at address 0x4001080C.
This address is defined in the device specific file of the micro controller.
Now, the cortex M3 has processor core registers R0-R12 (general purpose). I want to know, do these registers also have some address like other peripherals?
So, if I have instruction: MOV R0, #10;
will R0 be translated to some address when assembled? Do core registers have special numeric addresses exclusive for core peripherals. Is address of R0 defined in any file (I couldn't find any) like that of GPIOA? Or is it that register R0 and other core registers are referred to as R0 and their respective names only so that the assembler sees "R0" and generates the opcode from it?
I have this confusion because some 8 bit controllers also have addresses for general purpose registers.
Thanks,
Navin

Registers like R0-R12 or SP, PC, .. are registers inside CPU core and they are not mapped to global address space. Access to these registers is possible only from assembler.
And also direct access to core registers from higher level languages like C is not possible, because they are not addressable. These registers are used for internal processing and they are transparent for the programmer.
But registers like GPIOA->ODR are mapped to global address space, so each register has own address.

General Purpose registers are meant to do general purpose operations with the CPU. This is just like we use few temporary variables in any programming language. So if we relate this to your question, CPU requires few reserved memory segments to do it's basic operations. Hence there is no point in sharing this to outside world. This is how ARM based processors work.

You happened to have picked an instruction that is really easy to see...
.thumb
mov r0,#10
mov r1,#10
mov r2,#10
mov r3,#10
mov r4,#10
mov r5,#10
mov r6,#10
mov r7,#10
assemble then disassemble to see the machine code
Disassembly of section .text:
00000000 <.text>:
0: 200a movs r0, #10
2: 210a movs r1, #10
4: 220a movs r2, #10
6: 230a movs r3, #10
8: 240a movs r4, #10
a: 250a movs r5, #10
c: 260a movs r6, #10
e: 270a movs r7, #10
there will be three or four bits depending on the instruction and instruction set (arm vs thumb (and then thumb2 extensions)) that specify the register. In this case those bits happen to line up nicely with the hex representation of the machine code instruction so we can see the 0 through 7. For a cortex-m3 many of the thumb instructions are limited to r0-r7 (implying a 3 bit field within the instruction) with one or two to move between the lower and upper, thumb2 extensions allow for more access to the full r0-r15 (and thus will have a 4 bit field in the instruction). You should get the armv7m architectural reference manual which is what is associated with the cortex-m3 (after you get the cortex-m3 technical reference manual and see that it uses the armv7m architecture), you can also get the oldest armv5 architectural reference manual as it has the oldest description of the thumb instruction set which is the one instruciton set that is compatible across all arm cores armv6m covers the cortex-m0 which has a lot fewer thumb2 extensions then armv7m which covers the cortex-m3 m4 and m7 have tons more thumb2 extensions.
another example that takes only a second to try
.thumb
mov r0,r0
mov r1,r1
mov r2,r2
mov r3,r3
mov r4,r4
mov r5,r5
mov r6,r6
mov r7,r7
mov r0,r0
mov r1,r0
mov r2,r0
mov r3,r0
mov r4,r0
mov r5,r0
mov r6,r0
mov r7,r0
Disassembly of section .text:
00000000 <.text>:
0: 1c00 adds r0, r0, #0
2: 1c09 adds r1, r1, #0
4: 1c12 adds r2, r2, #0
6: 1c1b adds r3, r3, #0
8: 1c24 adds r4, r4, #0
a: 1c2d adds r5, r5, #0
c: 1c36 adds r6, r6, #0
e: 1c3f adds r7, r7, #0
10: 1c00 adds r0, r0, #0
12: 1c01 adds r1, r0, #0
14: 1c02 adds r2, r0, #0
16: 1c03 adds r3, r0, #0
18: 1c04 adds r4, r0, #0
1a: 1c05 adds r5, r0, #0
1c: 1c06 adds r6, r0, #0
1e: 1c07 adds r7, r0, #0
note that the bits didnt line up as nicely as before with hex values, doesnt matter look at the binary to see the three bits changing from instruction to instruction.
in this case the assembler chose to use an add instead of mov
Notes:
Encoding: This instruction is encoded as ADD Rd, Rn, #0.
and
Notes
Operand restriction: If a low register is specified for and
H1==0 and H2==0), the result is UNPREDICTABLE .
All this plus a zillion more things you learn when you read the documentation. http://infocenter.arm.com. on the left arm architecture then reference manuals you may have to sacrifice an email address. you can google arm architectural reference manual and you may get lucky...

What is ORRr instruction do in ARM

I am not able to understand below command used in cache initialization.
inv_loop2
LSL r3, r1, r6
LSL r8, r2, r7
ORRr 3, r3, r8
In this how ORRr work? I know about ORR instruction but ORRr is quite confusing

This is a typo.
You are looking at an old version of the documentation for enabling caches on ARM processors (M-7?). The updated documentation is here, and does not contain the same typo:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0646b/BABGJGCH.html

Cortex m3 first instruction execution

I am using Sourcery CodeBench Lite 2012.03-56 compiler and gdb suite with texane gdb server.
Today I wanted to try FreeRTOS demo example for cheap STM32VLDISCOVERY board, I copied all the source files needed, compiled without errors but the example didn't work. I fired up debugger and noticed that example fails when it is trying to dereference pointer to GPIO registers. Global array variable which contains pointers to GPIO registers:
GPIO_TypeDef* GPIO_PORT[LEDn] = {LED3_GPIO_PORT, LED4_GPIO_PORT};
was not properly initialized and was filled with some random values. I checked preprocessor defines LED3_GPIO_PORT and LED3_GPIO_PORT and they were valid.
After some researching where the problem might be I looked at the start-up file provided for trueSTUDIO found in CMSIS lib. Original startup_stm32f10x_md_vl.S file:
.section .text.Reset_Handler
.weak Reset_Handler
.type Reset_Handler, %function
Reset_Handler:
/* Copy the data segment initializers from flash to SRAM */
movs r1, #0
b LoopCopyDataInit
CopyDataInit:
ldr r3, =_sidata
ldr r3, [r3, r1]
str r3, [r0, r1]
adds r1, r1, #4
LoopCopyDataInit:
ldr r0, =_sdata
ldr r3, =_edata
adds r2, r0, r1
cmp r2, r3
bcc CopyDataInit
ldr r2, =_sbss
b LoopFillZerobss
...
During the debugging I noticed that register r1 is never initialized to zero by first instruction movs r1, #0. Register r1 is used as a counter in loop so when the execution reaches loop LoopCopyDataInit it never enters the loop since register r1 is loaded with some garbage data from previous execution. As the result of this the startup code never initializes the .data section.
When I placed two nop instructions before movs r1, #0 instruction then register r1 was initialized to 0 and the example began to work:
Modified part of startup_stm32f10x_md_vl.S file:
/* Copy the data segment initializers from flash to SRAM */
nop
nop
movs r1, #0
b LoopCopyDataInit
This is disassembly of relevant parts of final code:
Disassembly of section .isr_vector:
08000000 <g_pfnVectors>:
8000000: 20002000 andcs r2, r0, r0
8000004: 08000961 stmdaeq r0, {r0, r5, r6, r8, fp}
...
Disassembly of section .text:
...
8000960 <Reset_Handler>:
8000960: 2100 movs r1, #0
8000962: f000 b804 b.w 800096e <LoopCopyDataInit>
08000966 <CopyDataInit>:
8000966: 4b0d ldr r3, [pc, #52] ; (800099c <LoopFillZerobss+0x16>)
8000968: 585b ldr r3, [r3, r1]
800096a: 5043 str r3, [r0, r1]
800096c: 3104 adds r1, #4
As you can see the ISR vector table is properly pointing to Reset_Handler address. So, what is happening? Why the first instruction movs r1, #0 was never executed in original start-up code?
EDIT:
The original code works when I power off the board and power it back on again. I can reset the MCU multiple times and it works. When I start gdb-server then the code doesn't work, even after reset. I have to power cycle it again to work. I guess this is some debugger weirdness going on.
NOTE:
I had a look what start-up code are other people using with this MCU and they either disable interrupts or load SP register with a linker defined value which is in both cases redundant. If they got hit by this odd behaviour they would not notice it, ever.

Sounds like a bug in your debugger. Probably it sets a breakpoint on the first instruction and either skips it completely or somehow reexecuting it doesn't work properly. The issue could be complicated by the fact that it's a reset vector, maybe it's just not possible to reliably stop at the first instruction. Since the NOPs help, I'd recommend leaving them in place while you're developing your program.
However, there is an alternative solution. Since it's unlikely that you'll need to modify the array, you don't really need it in writable section. To have the compiler put the array in the flash, usually it's enough to declare it as const:
GPIO_TypeDef* const GPIO_PORT[LEDn] = {LED3_GPIO_PORT, LED4_GPIO_PORT};

Nothing is hopping out right away as to what could be wrong there. First, how are you debugging this code? Are you attaching a debugger and then issuing a reset to the processor via JTAG? I would try putting b Reset_Handler in there right after your Reset_Handler: label as your first instruction, flash it on, turn on the board, then connect the JTAG just so you can minimize any possible weirdness from the debugger. Then set your PC to that mov instruction and see if it works. Is a bootloader or boot ROM launching this code? It could be something weird going on with the instruction or data cache.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight