Assembling armv8-a neon with gnu assembler

Assembling armv8-a neon with gnu assembler - arm

I am trying to assemble aarch64 neon instructions with the gnu assembler. The example is from the neon programming quick reference
.text
.align 4
.global add_float_neon2
.type add_float_neon2, %function
add_float_neon2:
.L_loop:
ld1 {v0.4s}, [x1], #16
ld1 {v1.4s}, [x2], #16
fadd v0.4s, v0.4s, v1.4s
subs x3, x3, #4
st1 {v0.4s}, [x0], #16
bgt .L_loop
ret
When running the gnu assembler I get the following error:
arm-linux-gnueabi-as -march=armv8-a -mfpu=neon test.s
test.s: Assembler messages:
test.s:13: Error: bad instruction `ld1 {v0.4s},[x1],#16'
test.s:15: Error: bad instruction `ld1 {v1.4s},[x2],#16'
test.s:17: Error: bad instruction `fadd v0.4s,v0.4s,v1.4s'
test.s:19: Error: ARM register expected -- `subs x3,x3,#4'
test.s:21: Error: bad instruction `st1 {v0.4s},[x0],#16'
test.s:25: Error: bad instruction `ret'
It cannot assemble any instructions with the <Vd>.<T> operand formats, even if I try running it with a single instruction. What am I doing wrong?

Related

Call C function from Assembly, passing args and getting the return value in the ARM calling convention

I want to call a C function, say:
int foo(int a, int b) {return 2;}
inside an assembly (ARM) code. I read that I need to mention
import foo
in my assembly code, for assembler to search for foo in C file. But, I am stuck at passing arguments a and b from assembly and retrieving an integer (here 2) again back in assembly. Could someone could explain me how to do this, with a mini example?

You have already written the minimal example.
int foo(int a, int b) {return 2;}
compile and disassemble
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -d so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <foo>:
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
Anything to do with a and b are dead code so optimized out. While using C to learn asm is good/okay to get started you really want to do it with optimizations on which mean you have to work harder on crafting the experimental code.
int foo(int a, int b) {return 2;}
int bar ( void )
{
return(foo(5,4));
}
and we learn nothing new.
Disassembly of section .text:
00000000 <foo>:
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
00000008 <bar>:
8: e3a00002 mov r0, #2
c: e12fff1e bx lr
need to do this for the call:
int foo(int a, int b);
int bar ( void )
{
return(foo(5,4));
}
and now we see
00000000 <bar>:
0: e92d4010 push {r4, lr}
4: e3a01004 mov r1, #4
8: e3a00005 mov r0, #5
c: ebfffffe bl 0 <foo>
10: e8bd4010 pop {r4, lr}
14: e12fff1e bx lr
(yes this is built for the this compilers default target armv4t, should be obvious to some others have no clue how I/we know)(can also tell how new/old the compiler is from this example as well (there was an abi change years ago that is visible here)(the newer versions of gcc are worse than older so older is still good to use for some use cases))
per this compilers convention (now while this compiler does use the arm convention of some version of some document for some version of this compiler, always remember it is the compiler authors choice, they are under no obligation to conform to anyones written standard, they choose)
So we see that the first parameter goes in r0, the second in r1. You can craft functions with more operands or more types of operands to see what nuances there are. How many are in registers and when they start using the stack instead. For example try a 64 bit variable then a 32 in that order as operands then try it in reverse.
To see what is going on on the callee side.
int foo(int a, int b)
{
return((a<<1)+b+0x123);
}
We see that r0 and r1 are the first two operands, the compiler would be grossly broken otherwise.
00000000 <foo>:
0: e0810080 add r0, r1, r0, lsl #1
4: e2800e12 add r0, r0, #288 ; 0x120
8: e2800003 add r0, r0, #3
c: e12fff1e bx lr
What we did not see explicitly in the caller example is that r0 is where the return is stored (at least for this variable type).
The ABI documention is not an easy read, but if you first "just try it" then if you wish refer to the documentation it should help with the documentation. At the end of the day you have a compiler you are going to use, it has a convention and is probably part of a toolchain so you must conform to that compilers convention not some third party document (even if that third party is arm) AND you should probably use that toolchain's assembler which means you should use that assembly language (many incompatible assembly languages for arm, the tool defines the language not the target).
You can see how simple it is to figure this out on your own.
And...so this gets painful but you can look at the assembly output of the compiler, at least some will let you. With gcc you can use -save-temps or -S
int foo(int a, int b)
{
return 2;
}
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global foo
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type foo, %function
foo:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
mov r0, #2
bx lr
.size foo, .-foo
.ident "GCC: (15:9-2019-q4-0ubuntu1) 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]"
Almost none of this do you "need".
The minimum looks like this
.globl foo
foo:
mov r0,#2
bx lr
.global or .globl are equivalent, somewhat reflects the age or how/when you learned gnu assembler.
Now this will break if you are mixing arm and thumb instructions, this defaults to arm.
arm-none-eabi-as x.s -o x.o
arm-none-eabi-objdump -d x.o
x.o: file format elf32-littlearm
Disassembly of section .text:
00000000 :
0: e3a00002 mov r0, #2
4: e12fff1e bx lr
If we want thumb then we have to tell it
.thumb
.globl foo
foo:
mov r0,#2
bx lr
and we get thumb.
00000000 <foo>:
0: 2002 movs r0, #2
2: 4770 bx lr
With ARM and with the gnu toolchain at least you can mix arm and thumb and the linker will take care of the transition
int foo ( int, int );
int fun ( void )
{
return(foo(1,2));
}
we do not need a bootstrap nor other things to get the linker to link so we can see how that part of it works.
arm-none-eabi-ld so.o x.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -d so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <fun>:
8000: e92d4010 push {r4, lr}
8004: e3a01002 mov r1, #2
8008: e3a00001 mov r0, #1
800c: eb000001 bl 8018 <foo>
8010: e8bd4010 pop {r4, lr}
8014: e12fff1e bx lr
00008018 <foo>:
8018: 2002 movs r0, #2
801a: 4770 bx lr
Now this is broken not just because we have no bootstrap, etc, but there is a bl to foo but foo is thumb and the caller is arm. So for gnu assembler for arm you can take this shortcut which I think I learned from an older gcc, but whatever
.thumb
.thumb_func
.globl foo
foo:
mov r0,#2
bx lr
.thumb_func says the next label you find is considered a function label not just an address.
00008000 <fun>:
8000: e92d4010 push {r4, lr}
8004: e3a01002 mov r1, #2
8008: e3a00001 mov r0, #1
800c: eb000003 bl 8020 <__foo_from_arm>
8010: e8bd4010 pop {r4, lr}
8014: e12fff1e bx lr
00008018 <foo>:
8018: 2002 movs r0, #2
801a: 4770 bx lr
801c: 0000 movs r0, r0
...
00008020 <__foo_from_arm>:
8020: e59fc000 ldr ip, [pc] ; 8028 <__foo_from_arm+0x8>
8024: e12fff1c bx ip
8028: 00008019 .word 0x00008019
802c: 00000000 .word 0x00000000
The linker adds a trampoline as I call it, I think others call it a vaneer. Either way the toolchain took care of is so long as we write the code right.
Remember and in particular this syntax for the assembler is very much assembler specific other assemblers may have other syntax to make this work. From the gcc generated code we see the generic solution which is more typing but probably a better habit.
.thumb
.type foo, %function
.global foo
foo:
mov r0,#2
bx lr
the .type foo, %function works for both arm and thumb in gnu assembler for arm. And it does not have to be positioned just before the labe (just like .globl or .global does not either. We get the same result from the toolchain with this assembly language.
Just for demonstration...
arm-none-eabi-as x.s -o x.o
arm-none-eabi-gcc -O2 -mthumb -c so.c -o so.o
arm-none-eabi-ld so.o x.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -d so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <fun>:
8000: b510 push {r4, lr}
8002: 2102 movs r1, #2
8004: 2001 movs r0, #1
8006: f000 f807 bl 8018 <__foo_from_thumb>
800a: bc10 pop {r4}
800c: bc02 pop {r1}
800e: 4708 bx r1
00008010 <foo>:
8010: e3a00002 mov r0, #2
8014: e12fff1e bx lr
00008018 <__foo_from_thumb>:
8018: 4778 bx pc
801a: e7fd b.n 8018 <__foo_from_thumb>
801c: eafffffb b 8010 <foo>
And you can see it works both ways thumb to arm arm to thumb if we write the asm write it does the rest of the work for us.
Now I personally hate the unified syntax, it is one of the major mistakes arm has made along with CMSIS. But, you want to do this for a living you find that you pretty much hate most corporate decisions and worse, have to work/operate with them. Often the time unified syntax generates the wrong instruction and have to fiddle with the syntax to get it to work, but if I have to get a specific instruction then I have to fiddle about to get it to generate the specific instruction I am after. Other than a bootstrap and some other exceptions you do not often write assembly language anyway, usually compile something then take the compiler generated code and tune it or replace it.
I started with the arm gnu tools before unified syntax so I am used to
.thumb
.globl hello
hello:
sub r0,#1
bne hello
instead of
.thumb
.globl hello
hello:
subs r0,#1
bne hello
And fine with bouncing between the two syntaxes (unified and not, yes two assembly languages within one tool).
All of the above is with the 32 bit arm, if you are interested in 64 bit arm, AND using gnu tools, then a percentage of this still applies, you just need to use the aarch64 tools not the arm tools from gnu. ARM's aarch64 is a completely different, and incompatible, instruction set from aarch32. But gnu syntax like .global and .type...function are often used across all gnu supported targets. There are exceptions for some directives, but if you take the same approach of having the tools themselves tell you how they work...by using them...You can figure this out.
so.elf: file format elf64-littleaarch64
Disassembly of section .text:
0000000000400000 <fun>:
400000: 52800041 mov w1, #0x2 // #2
400004: 52800020 mov w0, #0x1 // #1
400008: 14000001 b 40000c <foo>
000000000040000c <foo>:
40000c: 52800040 mov w0, #0x2 // #2
400010: d65f03c0 ret

What you need to do is place the arguments in the correct registers (or on the stack) as required. All the details on how to do this are what is known as the calling convention and forms a very important part of the Application Binary Interface(ABI).
Details on the ARM (Armv7) calling convention can be found at: https://developer.arm.com/documentation/den0013/d/Application-Binary-Interfaces/Procedure-Call-Standard

How to Generate Exceptions on Cortex M3?

I am trying to generate exceptions like Bus Fault, Usage Fault on ARM Cortex-M3. My code for enable exceptions:
void EnableExceptions(void)
{
UINT32 uReg = SCB->SHCSR;
uReg |= 0x00070000;
SCB->SHCSR = uReg;
//Set Configurable Fault Status Register
SCB->CFSR = 0x0367E7C3;
//Set to 1 DIV_0_TRP register
SCB->CCR |= 0x00000010;
//Set priorities of fault handlers
NVIC_SetPriority(MemoryManagement_IRQn, 0x01);
NVIC_SetPriority(BusFault_IRQn, 0x01);
NVIC_SetPriority(UsageFault_IRQn, 0x01);
}
void UsageFault_Handler(void){
//handle
//I've set a breakpoint but system does not hit
}
void BusFault_Handler(void){
//handle
//I've set a breakpoint but system does not hit
}
I tried to generate division by zero exception and saw the variables value as "Infinity". However system does not generate any exception on keeps running. Also tried to generate Bus Fault exception and same thing happens.
Also when I comment out to EnableExceptions function system works correctly. What is wrong with my code? Does ARM handle these kind of errors inside of the microprocessor?

Cortex-M devices use the Thumb-2 instruction set exclusively, ARM uses the least significant bit of the the branch/jump/call address to determine whether the target is Thumb or ARM code, since the Cortex-M cannot run ARM code, you can generate an BusFault exception by creating a jump to an even address.
int dummy(){ volatile x = 0 ; return x ; }
int main()
{
typedef void (*fn_t)();
fn_t foo = (fn_t)(((char*)dummy) - 1) ;
foo() ;
}
The following will also work, since the call will fail before any instructions are executed, so it does not need to point to any valid code.
int main()
{
typedef void (*fn_t)();
fn_t foo = (fn_t)(0x8004000) ;
foo() ;
}
You can generate a usage fault by forcing an integer divide by zero:
int main()
{
volatile int x = 0 ;
volatile int y = 1 / x ;
}

From your comment question: How to generate any exception.
Here is one from the documentation:
Encoding T1 All versions of the Thumb instruction set.
SVC<c> #<imm8>
...
Exceptions
SVCall.
Which I can find by searching for SVCall.
Exceptions are well documented by ARM, there are ways to cause the exceptions you listed without having to break the bus (requiring a sim an fpga or creating your own silicon), you already know the search terms for the document to find busfault and usagefault.
How ARM handles these (internally or not) is documented. Internally in this case means a lockup or not if you look, otherwise they execute the fault handler (unless of course there is a fault fetching the fault handler).
Most you can create in C without resorting to assembly language instructions, but you have to be careful that it is generating what you think it is generating:
void fun ( void )
{
int x = 3;
int y = 0;
int z = x / y;
}
Disassembly of section .text:
00000000 <fun>:
0: 4770 bx lr
Instead you want something that actually generates the instruction that can cause the fault:
int fun0 ( int x, int y )
{
return(x/y);
}
void fun1 ( void )
{
fun0(3,0);
}
00000000 <fun0>:
0: fb90 f0f1 sdiv r0, r0, r1
4: 4770 bx lr
6: bf00 nop
00000008 <fun1>:
8: 4770 bx lr
but as shown you have to be careful about where and how you call it. In this case the call was done in the same file so the optimizer had visibility to see that this is now dead code and optimized it out, so a test like this would fail to generate a fault for multiple reasons.
This is why the OP needs to provide a complete minimal example the reason why faults are not being seen is not the processor. But the software and/or test code.
Edit
A complete minimal example, everything you need but a gnu toolchain (no . This is on a stm32 blue pill an STM32F103...
flash.s
.cpu cortex-m3
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset /* 1 Reset */
.word hang /* 2 NMI */
.word hang /* 3 HardFault */
.word hang /* 4 MemManage */
.word hang /* 5 BusFault */
.word usagefault /* 6 UsageFault */
.word hang /* 7 Reserved */
.word hang /* 8 Reserved */
.word hang /* 9 Reserved */
.word hang /*10 Reserved */
.word hang /*11 SVCall */
.word hang /*12 DebugMonitor */
.word hang /*13 Reserved */
.word hang /*14 PendSV */
.word hang /*15 SysTick */
.word hang /* External interrupt 1 */
.word hang /* External interrupt 2 */
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
.thumb_func
.globl dosvc
dosvc:
svc 1
.thumb_func
.globl hop
hop:
bx r0
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
fun.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
void hop ( unsigned int );
#define GPIOCBASE 0x40011000
#define RCCBASE 0x40021000
#define SHCSR 0xE000ED24
void usagefault ( void )
{
unsigned int ra;
while(1)
{
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<100000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<100000;ra++) dummy(ra);
}
}
int notmain ( void )
{
unsigned int ra;
ra=GET32(SHCSR);
ra|=1<<18; //usagefault
PUT32(SHCSR,ra);
ra=GET32(RCCBASE+0x18);
ra|=1<<4; //enable port c
PUT32(RCCBASE+0x18,ra);
ra=GET32(GPIOCBASE+0x04);
ra&=(~(3<<20)); //PC13
ra|= (1<<20) ; //PC13
ra&=(~(3<<22)); //PC13
ra|= (0<<22) ; //PC13
PUT32(GPIOCBASE+0x04,ra);
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<200000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<200000;ra++) dummy(ra);
ra=GET32(0x08000004);
ra&=(~1);
hop(ra);
return(0);
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c so.c -o so.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
All of those command line options are not required arm-linux-gnueabi- and other flavors of gnu toolchains work just fine from several versions back to the present as I use them as a compiler, assembler and linker and don't mess with library or other stuff that varies from one flavor to another.
UsageFault The UsageFault fault handles non-memory related faults
caused by instruction execution.
A number of different situations cause usage faults, including:
• Undefined Instruction.
• Invalid state on instruction execution.
• Error on exception return.
• Attempting to access a disabled or unavailable coprocessor.
The following can cause usage faults when the processor is configured to
report them:
• A word or halfword memory accesses to an unaligned address.
• Division by zero.
Software can disable this fault. If it does, a UsageFault escalates to HardFault. UsageFault has a configurable priority.
...
Instruction execution with EPSR.T set to 0 causes the invalid state UsageFault
So the test here branches to an arm address vs a thumb address and this causes a usagefault. (Can read up about the BX instruction the psr.t bit how and when it gets changed, etc in the documentation as well)
Backing up this is the stm32 blue pill. There is an led on PC13, the code enables usagefault, configures PC13 as an output, blinks it once so we see the program started then if it hits the usagefault handler then it blinks forever.
ra&=(~1);
If you comment this out then it keeps branching to reset which does everything again one slow blink and you see that repeat forever.
Before running naturally you check the build to see that it has a chance of working:
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000
8000004: 08000049
8000008: 0800004f
800000c: 0800004f
8000010: 0800004f
8000014: 0800004f
8000018: 08000061
800001c: 0800004f
8000020: 0800004f
8000024: 0800004f
8000028: 0800004f
800002c: 0800004f
8000030: 0800004f
8000034: 0800004f
8000038: 0800004f
800003c: 0800004f
8000040: 0800004f
8000044: 0800004f
08000048 <reset>:
8000048: f000 f82a bl 80000a0 <notmain>
800004c: e7ff b.n 800004e <hang>
0800004e <hang>:
800004e: e7fe b.n 800004e <hang>
...
08000060 <usagefault>:
8000060: b570 push {r4, r5, r6, lr}
The vector table is correct the right vectors point to the right places.
0xE000ED28 CFSR RW 0x00000000
The HFSR is upper bits of the CFSR
> halt
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x81000006 pc: 0x0800008a msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 00020000
And that bit is
INVSTATE, bit[1]
0 EPSR.T bit and EPSR.IT bits are valid for instruction execution.
1 Instruction executed with invalid EPSR.T or EPSR.IT field.
Now
Using the CCR, see Configuration and Control Register, CCR on page B3-604, software can enable or disable:
• Divide by zero faults, alignment faults and some features of processor operation.
• BusFaults at priority -1 and higher.
The reset value of the CCR is IMPLEMENTATION DEFINED so it might just be enabled for you or not, likely have to look at the Cortex-m3 TRM or just read it:
> mdw 0xE000ED14
0xe000ed14: 00000000
so its zeros on mine.
So add fun.c:
unsigned int fun ( unsigned int x, unsigned int y)
{
return(x/y);
}
Change so.c:
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
unsigned int fun ( unsigned int, unsigned int);
#define GPIOCBASE 0x40011000
#define RCCBASE 0x40021000
#define SHCSR 0xE000ED24
#define CCR 0xE000ED14
void usagefault ( void )
{
unsigned int ra;
while(1)
{
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<100000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<100000;ra++) dummy(ra);
}
}
int notmain ( void )
{
unsigned int ra;
ra=GET32(SHCSR);
ra|=1<<18; //usagefault
PUT32(SHCSR,ra);
ra=GET32(CCR);
ra|=1<<4; //div by zero
PUT32(CCR,ra);
ra=GET32(RCCBASE+0x18);
ra|=1<<4; //enable port c
PUT32(RCCBASE+0x18,ra);
ra=GET32(GPIOCBASE+0x04);
ra&=(~(3<<20)); //PC13
ra|= (1<<20) ; //PC13
ra&=(~(3<<22)); //PC13
ra|= (0<<22) ; //PC13
PUT32(GPIOCBASE+0x04,ra);
PUT32(GPIOCBASE+0x10,1<<(13+0));
for(ra=0;ra<200000;ra++) dummy(ra);
PUT32(GPIOCBASE+0x10,1<<(13+16));
for(ra=0;ra<200000;ra++) dummy(ra);
fun(3,0);
return(0);
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c so.c -o so.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c fun.c -o fun.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o so.o fun.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
confirm there is actually a divide instruction that we are going to hit
800011e: 2100 movs r1, #0
8000120: 2003 movs r0, #3
8000122: f000 f80f bl 8000144 <fun>
...
08000144 <fun>:
8000144: fbb0 f0f1 udiv r0, r0, r1
8000148: 4770 bx lr
load and run and the handler is called.
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x81000006 pc: 0x08000086 msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 02000000
and that indicates it was a divide by zero.
So everything you needed to know/do really was in the documentation, one document.
99.999% of bare-metal programming is reading or doing experiments to validate what was read, almost none of the work is writing the final application, it is but a tiny part of the job.
Before you can get to bare-metal programming you have to have mastered the toolchain otherwise nothing will work. Mastering the toolchain can be done without any target hardware, using free tools so that is just a matter of sitting down and doing it.
As far as you're trying to do a floating point divide by zero on a core that doesn't have hardware divide by zero you need to look at the soft float, for example libgcc:
ARM_FUNC_START divsf3
ARM_FUNC_ALIAS aeabi_fdiv divsf3
CFI_START_FUNCTION
# Mask out exponents, trap any zero/denormal/INF/NAN.
mov ip, #0xff
ands r2, ip, r0, lsr #23
do_it ne, tt
COND(and,s,ne) r3, ip, r1, lsr #23
teqne r2, ip
teqne r3, ip
beq LSYM(Ldv_s)
LSYM(Ldv_x):
...
# Division by 0x1p*: let''s shortcut a lot of code.
LSYM(Ldv_1):
and ip, ip, #0x80000000
orr r0, ip, r0, lsr #9
adds r2, r2, #127
do_it gt, tt
COND(rsb,s,gt) r3, r2, #255
and so on
which should have been visible in the disassembly, I don't off-hand see a forced exception (intentional undefined instruction, swi/svc or anything like that). This is only one possible library and now that I think about it this looks like arm not thumb, so would have to go looking for that, easier to just look at the disassembly.
Based on your comment and if I read the other question again I assume because it didn't raise an exception the correct result of a divide by zero is a properly signed infinity. but if you switch to a cortex-m4 or m7 then you might be able to trigger a hardware exception, but....read the documentation to find out.
Edit 2
void fun ( void )
{
int a = 3;
int b = 0;
volatile int c = a/b;
}
fun.c:6:18: warning: unused variable ‘c’ [-Wunused-variable]
6 | volatile int c = a/b;
| ^
08000140 <fun>:
8000140: deff udf #255 ; 0xff
8000142: bf00 nop
> halt
target halted due to debug-request, current mode: Handler UsageFault
xPSR: 0x01000006 pc: 0x08000076 msp: 0x20000fc0
> mdw 0xE000ED28
0xe000ed28: 00010000
and that bit means
The processor has attempted to execute an undefined instruction
So volatile failed to produce the desired result using gcc
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-linux-gnueabi-gcc --version
arm-linux-gnueabi-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
produces
Disassembly of section .text:
00000000 <fun>:
0: deff udf #255 ; 0xff
2: bf00 nop
as well (and you can godbolt your way through others).
Yes a fault was produced and it was the usage fault but this would have been yet another Stack Overflow question as to why I didn't get a divide by zero. Blindly using volatile to force a divide doesn't work.
Making all three volatile
void fun ( void )
{
volatile int a = 3;
volatile int b = 0;
volatile int c = a/b;
}
Disassembly of section .text:
00000000 <fun>:
0: 2203 movs r2, #3
2: 2300 movs r3, #0
4: b084 sub sp, #16
6: 9201 str r2, [sp, #4]
8: 9302 str r3, [sp, #8]
a: 9b01 ldr r3, [sp, #4]
c: 9a02 ldr r2, [sp, #8]
e: fb93 f3f2 sdiv r3, r3, r2
12: 9303 str r3, [sp, #12]
14: b004 add sp, #16
16: 4770 bx lr
will generate the desired fault.
and with no optimizations
00000000 <fun>:
0: b480 push {r7}
2: b085 sub sp, #20
4: af00 add r7, sp, #0
6: 2303 movs r3, #3
8: 60fb str r3, [r7, #12]
a: 2300 movs r3, #0
c: 60bb str r3, [r7, #8]
e: 68fa ldr r2, [r7, #12]
10: 68bb ldr r3, [r7, #8]
12: fb92 f3f3 sdiv r3, r2, r3
16: 607b str r3, [r7, #4]
18: bf00 nop
1a: 3714 adds r7, #20
1c: 46bd mov sp, r7
1e: bc80 pop {r7}
20: 4770 bx lr
will also generate the desired fault
So master the language first (read read read), then master the toolchain second (read read read) then bare-metal programming (read read read). It is all about reading, not about coding. As shown above even with decades of experience at this level, you can't completely predict what the tools will generate; you have to just try it, but most important because you figured it out for one tool one time one day on one machine no reason to get too broad in your assumptions. Have to try it and examine what the compiler produces, repeat the process until you get the desired effect. Push comes to shove just write a few lines of asm and be done with it.
You weren't seeing faults because you weren't generating any and/or weren't trapping them or both. The list of possible reasons why is long based on the code provided, but these examples, that you should have no problem porting to your platform, should demonstrate your hardware works too, and then you can sort out why your software didn't by connecting the dots between code that does and code that doesn't. All I did was follow the documentation, and examine the output of the compiler, once I had the minimum number of things enabled, the fault handler was called. Without those enabled the usage fault was not triggered.

BusFault, HardFault, MemmanageFault, UsageFault, SVC Call , NMI those are internal exception for arm cortex-M microprocessors.
it depends really from which processor you are using, but let's suppose you are having cortex-m3:
By default all fault are mapped to hardfault handler unless you
enable them explicitly to get mapped to their own handler
Set bits: USGFAULTENA, BUSFAULTENA, MEMFAULTENA in system handler
control and state register => those fault each one will be mapped to
its proper handler
To generate a fault you can try :
Access not mapped memory area => this generate Busfault
Execute unrecognized instruction => UsageFault
Explictly trigger one of those fault by setting one of those bits:
USGFAULTACT, BUSFAULTACT, MEMFAULTACT in systen handler control and
status register => this will generate an exception for sure
Please for more details refer to : https://developer.arm.com/documentation/dui0552/a/cortex-m3-peripherals/system-control-block/system-handler-control-and-state-register?lang=en

Run Hello World on ARM emulator

I'm trying to emulate a ARM-cpu at the moment. I read a lot about how to emulate a cpu. Right now I managed to write down all OpCodes, Registers etc and compiled a Hello World on a ARM-Maschine written in C, to get the assembler code (to test my emulator).
Assembler
.arch armv8-a
.file "hello.c"
.section .text.startup, "ax", #progbits
.align 2
.p2align 3,,7
.global main
.type main, %function
main:
stp x29, x30, [sp, -16]!
adrp x1, .LC0
add x1, x1, :lo12:.LC0
mov w0, 1
add x29, sp, 0
bl __printf_chk
mov w0, 0
ldp x29, x30, [sp], 16
ret
.size main, .-main
.section .rodata.str1.8, "aMS", #progbits, 1
.align 3
.LC0:
.string "Hello World"
.ident "GCC: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10)"
.section .note.GNU-stack, "", #progbits
I have to load the code from the ROM to my emulated RAM from which the code will be executed. But how do I load it to the RAM? I dont understand how to split the OpCodes and the Registers.
Summary
I want to emulate a ARM-CPU on a Intel CPU with Windows. To test my emulator I wrote a Hello World in C and compiled it to get the assembler code. The assembler code is written on a ARM-Maschine with Ubuntu 16.04. My question is how to fetch the assembler code with my emulator.

Assembler not finding second operand in MOV

This is the file I am trying to assemble (filename: asmtut3.s):
.global _start
_start:
MOV R0, #20
MOV R7, #1
SWI 0
When I try to assemble it using:
as -o asmtut3.o asmtut3.s
I get the error:
asmtut3.s: Assembler messages:
asmtut3.s:3: Error: expecting operand after ','; got nothing
asmtut3.s:4: Error: expecting operand after ','; got nothing
asmtut3.s:5: Error: no such instruction: `swi 0'
I am running fedora 25, if that helps?

Use arm-none-eabi-as instead of as
arm-none-eabi-as -o asmtut3.o asmtut3.s

_start definition in U-boot source

I am understanding U-boot(v2014.07).
In the start.S(at arch/arm/cpu/armv7/) file it is loading vector base address using following instructions.
ldr r0, =_start
mcr p15, 0, r0, c12, c0, 0 #Set VBAR
Can you please guide to understand where "_start" is defined. I checked in start.S and lowlevel_init.S, but I couldn't find.

Can you please guide to understand where "_start" is defined
For the ARM architecture, _start is defined as a global in arch/arm/lib/vectors.S
When disassembly the start.o file, the "ldr r0, =_start" instruction is updated as "ldr r0, [pc, #104] ; 9c " .
That should correspond to the first entry in the 32-byte ARM Exception vector, i.e.
ldr pc, _reset