How to use the ARM XML Machine Readable Architecture Specification to assemble and disassemble? - arm

ARM has released a machine readable architecture specification as described at: https://alastairreid.github.io/ARM-v8a-xml-release/
Is there already a way to use that specification to assemble and disassemble ARM code, and if yes, how to use it with a minimal example?
There are some hints at: https://alastairreid.github.io/bidirectional-assemblers/ but I could not find a working implementation.
I know I can assemble and disassemble with existing GNU tools, but using the spec would:
always be ahead of the GNU tools as changes are made to the spec
more likely to be correct

Here is a Haskell implementation: https://github.com/nspin/hs-arm
I haven't tested it, but the README claims this:
import Harm
import Harm.Extra
import Control.Monad
main :: IO ()
main = do
(start, words) <- elfText "../test/nix-results/test.busybox/busybox"
forM_ (zip [start, start + 4..] words) $ \(offset, word) ->
putStrLn $ hex offset ++ " " ++ hex word ++ " " ++
case decode word of
Nothing -> ".inst " ++ hex word
Just insn -> padRight 30 (showAsmCol 7 insn) ++ encodingId insn
Produces something like:
0000000000400200 d11843ff sub sp, sp, #0x610 SUB_64_addsub_imm
0000000000400204 7100041f subs wzr, w0, #0x001 SUBS_32S_addsub_imm
0000000000400208 1a9fd7e0 csinc w0, wzr, wzr, le CSINC_32_condsel
000000000040020c 6a00003f ands wzr, w1, w0 ANDS_32_log_shift
0000000000400210 a9bd7bfd stp r29, r30, [sp, #-48]! STP_64_ldstpair_pre
0000000000400214 910003fd add x29, sp, #0x000 ADD_64_addsub_imm
0000000000400218 a90153f3 stp r19, r20, [sp, #16] STP_64_ldstpair_off
000000000040021c d0000c73 adrp r19, 0x00018e ADRP_only_pcreladdr
0000000000400220 91358263 add x3, x19, #0xd60 ADD_64_addsub_imm
0000000000400224 f9400064 ldr r4, [x3] LDR_64_ldst_pos
...

Related

How to record trace of assembly instructions and their corresponding timestamps of a C program on macOS?

I have the following C program:
int main() {
float number1, number2, sum=0.;
number1 = .5;
number2 = .3;
while(sum > -10000000.)
sum -= number1 + number2;
printf("%f",sum);
return 0;
}
Its corresponding assembly is as follows:
_main: ; #main
.cfi_startproc
; %bb.0:
sub sp, sp, #16 ; =16
.cfi_def_cfa_offset 16
str wzr, [sp, #12]
str wzr, [sp]
mov w8, #1056964608
str w8, [sp, #8]
mov w8, #39322
movk w8, #16025, lsl #16
str w8, [sp, #4]
LBB0_1: ; =>This Inner Loop Header: Depth=1
ldr s0, [sp]
fcvt d0, s0
adrp x8, lCPI0_0#PAGE
ldr d1, [x8, lCPI0_0#PAGEOFF]
fcmp d0, d1
b.le LBB0_3
; %bb.2: ; in Loop: Header=BB0_1 Depth=1
ldr s0, [sp, #8]
ldr s1, [sp, #4]
fadd s1, s0, s1
ldr s0, [sp]
fsub s0, s0, s1
str s0, [sp]
b LBB0_1
LBB0_3:
mov w0, #0
add sp, sp, #16 ; =16
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
I want to analyse latency of each instructions so I'm looking for ways to obtain program counter trace.
Desired output is as follows:
0000000000 _main: ; #main
0000000001 .cfi_startproc
0000000002; %bb.0:
0000000003 sub sp, sp, #16 ; =16
0000000004 .cfi_def_cfa_offset 16
0000000005 str wzr, [sp, #12]
0000000006 str wzr, [sp]
0000000007 mov w8, #1056964608
0000000008 str w8, [sp, #8]
0000000009 mov w8, #39322
0000000010 movk w8, #16025, lsl #16
0000000011 str w8, [sp, #4]
...
where the first columns is the timestamp either in pico/nano/microseconds.
Target system is macOS, compiler is llvm, debugger is lldb.
There is no way to precisely measure the instruction time at the granularity of few cycles (at least not on this target architecture). Thus, you cannot measure the latency of one specific instruction unless it is a very slow one. The reason is that the best instructions used to measure the time are themselves pretty long and the processor can execute multiple instructions per cycles and in an out of order way (not to mention they are pipelined). This is especially true for the M1 processor you appear to run on. On ARM, the way to measure time seems to read the PMCCNTR based on this post. You certainly need to care about the superscalar out-of-order execution even with such instruction though. The delay taken by such instruction is dependent of the target architecture and AFAIK there is no official public information targetting the M1 on this topic (in fact, the documentation is pretty scarce on the way the M1 execute instructions so far).
An alternative solution is to simulate the execution of the code with LLVM-MCA which performs a static analysis of the program so to simulate the scheduling of the instructions on the target architecture. The static analysis has a big downside: the actual runtime behaviour of loops and conditional jumps is not considered.
Note that profiling a non-optimized code is generally a huge waste of time as it does not reflect the actual execution of the release version (which should be optimized). Once optimized, the code is likely bounded by the dependency chain on sum. This is especially true on the M1 processor which can execute a lot of instructions in parallel on a same (big/performance) core.

gcc arm optimizes away parameters before System Call

I'm trying to implement some "OSEK-Services" on an arm7tdmi-s using gcc arm. Unfortunately turning up the optimization level results in "wrong" code generation. The main thing I dont understand is that the compiler seems to ignore the procedure call standard, e.g. passing parameters to a function by moving them into registers r0-r3. I understand that function calls can be inlined but still the parameters need to be in the registers to perform the system call.
Consider the following code to demonstrate my problem:
unsigned SysCall(unsigned param)
{
volatile unsigned ret_val;
__asm __volatile
(
"swi 0 \n\t" /* perform SystemCall */
"mov %[v], r0 \n\t" /* move the result into ret_val */
: [v]"=r"(ret_val)
:: "r0"
);
return ret_val; /* return the result */
}
int main()
{
unsigned retCode;
retCode = SysCall(5); // expect retCode to be 6 when returning back to usermode
}
I wrote the Top-Level software interrupt handler in assembly as follows:
.type SWIHandler, %function
.global SWIHandler
SWIHandler:
stmfd sp! , {r0-r2, lr} #save regs
ldr r0 , [lr, #-4] #load sysCall instruction and extract sysCall number
bic r0 , #0xff000000
ldr r3 , =DispatchTable #load dispatchTable
ldr r3 , [r3, r0, LSL #2] #load sysCall address into r3
ldmia sp, {r0-r2} #load parameters into r0-r2
mov lr, pc
bx r3
stmia sp ,{r0-r2} #store the result back on the stack
ldr lr, [sp, #12] #restore return address
ldmfd sp! , {r0-r2, lr} #load result into register
movs pc , lr #back to next instruction after swi 0
The dispatch table looks like this:
DispatchTable:
.word activateTaskService
.word getTaskStateService
The SystemCall function looks like this:
unsigned activateTaskService(unsigned tID)
{
return tID + 1; /* only for demonstration */
}
running without optimization everything works fine and the parameters are in the registers as to be expected:
See following code with -O0 optimization:
00000424 <main>:
424: e92d4800 push {fp, lr}
428: e28db004 add fp, sp, #4
42c: e24dd008 sub sp, sp, #8
430: e3a00005 mov r0, #5 #move param into r0
434: ebffffe1 bl 3c0 <SysCall>
000003c0 <SysCall>:
3c0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
3c4: e28db000 add fp, sp, #0
3c8: e24dd014 sub sp, sp, #20
3cc: e50b0010 str r0, [fp, #-16]
3d0: ef000000 svc 0x00000000
3d4: e1a02000 mov r2, r0
3d8: e50b2008 str r2, [fp, #-8]
3dc: e51b3008 ldr r3, [fp, #-8]
3e0: e1a00003 mov r0, r3
3e4: e24bd000 sub sp, fp, #0
3e8: e49db004 pop {fp} ; (ldr fp, [sp], #4)
3ec: e12fff1e bx lr
Compiling the same code with -O3 results in the following assembly code:
00000778 <main>:
778: e24dd008 sub sp, sp, #8
77c: ef000000 svc 0x00000000 #Inline SystemCall without passing params into r0
780: e1a02000 mov r2, r0
784: e3a00000 mov r0, #0
788: e58d2004 str r2, [sp, #4]
78c: e59d3004 ldr r3, [sp, #4]
790: e28dd008 add sp, sp, #8
794: e12fff1e bx lr
Notice how the systemCall gets inlined without assigning the value 5 t0 r0.
My first approach is to move those values manually into the registers by adapting the function SysCall from above as follows:
unsigned SysCall(volatile unsigned p1)
{
volatile unsigned ret_val;
__asm __volatile
(
"mov r0, %[p1] \n\t"
"swi 0 \n\t"
"mov %[v], r0 \n\t"
: [v]"=r"(ret_val)
: [p1]"r"(p1)
: "r0"
);
return ret_val;
}
It seems to work in this minimal example but Im not very sure whether this is the best possible practice. Why does the compiler think he can omit the parameters when inlining the function? Has somebody any suggestions whether this approach is okay or what should be done differently?
Thank you in advance
A function call in C source code does not instruct the compiler to call the function according to the ABI. It instructs the compiler to call the function according to the model in the C standard, which means the compiler must pass the arguments to the function in a way of its choosing and execute the function in a way that has the same observable effects as defined in the C standard.
Those observable effects do not include setting any processor registers. When a C compiler inlines a function, it is not required to set any particular processor registers. If it calls a function using an ABI for external calls, then it would have to set registers. Inline calls do not need to obey the ABI.
So merely putting your system request inside a function built of C source code does not guarantee that any registers will be set.
For ARM, what you should do is define register variables assigned to the required register(s) and use those as input and output to the assembly instructions:
unsigned SysCall(unsigned param)
{
register unsigned Parameter __asm__("r0") = param;
register unsigned Result __asm__("r0");
__asm__ volatile
(
"swi 0"
: "=r" (Result)
: "r" (Parameter)
: // "memory" // if any inputs are pointers
);
return Result;
}
(This is a major kludge by GCC; it is ugly, and the documentation is poor. But see also https://stackoverflow.com/tags/inline-assembly/info for some links. GCC for some ISAs has convenient specific-register constraints you can use instead of r, but not for ARM.) The register variables do not need to be volatile; the compiler knows they will be used as input and output for the assembly instructions.
The asm statement itself should be volatile if it has side effects other than producing a return value. (e.g. getpid() doesn't need to be volatile.)
A non-volatile asm statement with outputs can be optimized away if the output is unused, or hoisted out of loops if its used with the same input (like a pure function call). This is almost never what you want for a system call.
You also need a "memory" clobber if any of the inputs are pointers to memory that the kernel will read or modify. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details (and a way to use a dummy memory input or output to avoid a "memory" clobber.)
A "memory" clobber on mmap/munmap or other system calls that affect what memory means would also be wise; you don't want the compiler to decide to do a store after munmap instead of before.

Conversion of ARM code into C

Here is my assembly code for A9,
ldr x1, = 0x400020 // Const value may be address also
ldr w0, = 0x200018 // Const value may be address also
str w0, [x1]
The below one is expected output ?
*((u32 *)0x400020) = 0x200018;
When i cross checked with it by compiler it given differnet result as mov and movs insted of ldr. How to create ldr in c?
When i cross checked with it by compiler it given differnet result as mov and movs
It sounds to me like you compiled the C code with a compiler targetting AArch32, but the assembly code you've shown looks like it was written for AArch64.
Here's what I get when I compile with ARM64 GCC 5.4 and optimization level O3 (comments added by me):
mov x0, 32 # x0 = 0x20
mov w1, 24 # w1 = 0x18
movk x0, 0x40, lsl 16 # x0[31:16] = 0x40
movk w1, 0x20, lsl 16 # w1[31:16] = 0x20
str w1, [x0]
How to create ldr in c?
I can't see any good reason why you'd want the compiler to generate an LDR in this case.
LDR reg,=value is a pseudo-instruction that allows you to load immediates that cannot be encoded directly in the instruction word. The assembler achieves this by placing the value (e.g. 0x200018) in a literal pool, and then replacing ldr w0, =0x200018 with a PC-relative load from that literal pool (i.e. something like ldr w0,[pc,#offset_to_value]). Accessing memory is slow, so the compiler generated another sequence of instructions for you that achieves the same thing in a more efficient manner.
Pseudo-instructions are mainly a convenience for humans writing assembly code, making the code easier for them or their colleagues to read/write/maintain. Unlike a human being, a compiler doesn't get fatigued by repeating the same task over and over, and therefore doesn't have as much need for conveniences like that.
TL;DR: The compiler will generate what it thinks is the best (according to the current optimization level) instruction sequence. Also, that particular form of LDR is a pseudo-instruction, so you might not be able to get a compiler to generate it even if you disable all optimizations.

How do you define a macro in ARMv8?

I'm new to assembly language coding and I have not been able to find any answers to this question of the web. I'm not sure if I'm even asking the right question but I thought it couldn't hurt. Any help would be greatly appreciated.
The examples below assume one is using
GNU AS compiler.
Macros are neat and I use them a lot; however, they can be a pain to debug so be careful. If you are new to Assembly, I would suggest learning ARMv7 first since there are many more books, tutorials, etc. than ARMv8.
// push2
.macro push2, xreg1, xreg2
.push2\#:
stp \xreg1, \xreg2, [sp, #-16]!
.endm
// pop2
.macro pop2, xreg1, xreg2
.pop2\#:
ldp \xreg1, \xreg2, [sp], #16
.endm
// exit
.macro _exit
.exit\#:
mov x8, #93 // exit see /usr/include/asm-generic/unistd.h
svc 0
.endm
.macro gCode num // Grey code... https://en.wikipedia.org/wiki/Gray_code
.gCode\#:
mov x0, \num
eor x0, x0, x0, lsr 1 // G Code == B XOR (B >> 1 unsigned)
.endm

Recursion in ARM Assembly to Print 100

I'm very confused on how to recursively implement a method to print 100 (1..2..3..100) using ARM assembly.. I have the C code to do this and the C is very simple, but the assembly is a lot more and I have no clue how to do it.
Please help?
Thanks!
print100_recursive_ARM:
push {r4-r11, ip, lr}
CMP r0, #0
BEQ print_zero
SUB r0, r0, #1
BL print100_recursive_ARM
pop {r4-r11, ip, lr}
B print_num
print_num:
print_zero:
constant: .ascii "%d "
Print_ARM:
MOV r1, r0
LDR r0, =constant
BL printf
end:
pop {r4-r11, ip, lr}
BX lr
And this doesn't work.
Dirty trick: Write it in C, compile with e.g. gcc -S source.c, and analyze how the compiler did it (in source.s now). How to call/return, how to handle local variables, is typically complex. This way you get working assembler source to study or modify.

Resources