I'm new to assembly language coding and I have not been able to find any answers to this question of the web. I'm not sure if I'm even asking the right question but I thought it couldn't hurt. Any help would be greatly appreciated.
The examples below assume one is using
GNU AS compiler.
Macros are neat and I use them a lot; however, they can be a pain to debug so be careful. If you are new to Assembly, I would suggest learning ARMv7 first since there are many more books, tutorials, etc. than ARMv8.
// push2
.macro push2, xreg1, xreg2
.push2\#:
stp \xreg1, \xreg2, [sp, #-16]!
.endm
// pop2
.macro pop2, xreg1, xreg2
.pop2\#:
ldp \xreg1, \xreg2, [sp], #16
.endm
// exit
.macro _exit
.exit\#:
mov x8, #93 // exit see /usr/include/asm-generic/unistd.h
svc 0
.endm
.macro gCode num // Grey code... https://en.wikipedia.org/wiki/Gray_code
.gCode\#:
mov x0, \num
eor x0, x0, x0, lsr 1 // G Code == B XOR (B >> 1 unsigned)
.endm
Related
I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)
It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.
This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.
ARM has released a machine readable architecture specification as described at: https://alastairreid.github.io/ARM-v8a-xml-release/
Is there already a way to use that specification to assemble and disassemble ARM code, and if yes, how to use it with a minimal example?
There are some hints at: https://alastairreid.github.io/bidirectional-assemblers/ but I could not find a working implementation.
I know I can assemble and disassemble with existing GNU tools, but using the spec would:
always be ahead of the GNU tools as changes are made to the spec
more likely to be correct
Here is a Haskell implementation: https://github.com/nspin/hs-arm
I haven't tested it, but the README claims this:
import Harm
import Harm.Extra
import Control.Monad
main :: IO ()
main = do
(start, words) <- elfText "../test/nix-results/test.busybox/busybox"
forM_ (zip [start, start + 4..] words) $ \(offset, word) ->
putStrLn $ hex offset ++ " " ++ hex word ++ " " ++
case decode word of
Nothing -> ".inst " ++ hex word
Just insn -> padRight 30 (showAsmCol 7 insn) ++ encodingId insn
Produces something like:
0000000000400200 d11843ff sub sp, sp, #0x610 SUB_64_addsub_imm
0000000000400204 7100041f subs wzr, w0, #0x001 SUBS_32S_addsub_imm
0000000000400208 1a9fd7e0 csinc w0, wzr, wzr, le CSINC_32_condsel
000000000040020c 6a00003f ands wzr, w1, w0 ANDS_32_log_shift
0000000000400210 a9bd7bfd stp r29, r30, [sp, #-48]! STP_64_ldstpair_pre
0000000000400214 910003fd add x29, sp, #0x000 ADD_64_addsub_imm
0000000000400218 a90153f3 stp r19, r20, [sp, #16] STP_64_ldstpair_off
000000000040021c d0000c73 adrp r19, 0x00018e ADRP_only_pcreladdr
0000000000400220 91358263 add x3, x19, #0xd60 ADD_64_addsub_imm
0000000000400224 f9400064 ldr r4, [x3] LDR_64_ldst_pos
...
My goal is to implement sorting algorithm using C language.
I have to make a C code that converts into least number of instructions when compiled by gcc -O0(no optimization option) in ARM machine.
So, My idea is to embed quicksort implemented in assembly directly into C code.
I referred to several following documents and tried to implement my goal.
However, I don't know how to put intarray into my assembly function 'QuickSort' as a parameter.
Reference
1.https://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Quicksort#ARM_Assembly
2.http://forum.falinux.com/zbxe/index.php?mid=lecture_tip&comment_srl=517498&sort_index=readed_count&order_type=asc&l=fr&page=58&document_srl=567970 (sorry for non-english website)
I'm newbie in assembly.
Please help me..
#include <stdio.h>
#include <stdint.h>
int Quicksort(uint32_t intarray[]);
asm(
".global Quicksort\n\
Quicksort:\n\
qsort:\n\
stmfd sp!,{r4, r6, lr} \n\
mov r6,r2 \n\
qsort_tailcall_entry:\n\
sub r7,r6,r1\n\
cmp r7,#1\n\
ldmlefd sp!,{r4,r6,pc}\n\
ldr r7,[r0,r1,asl#2]\n\
add r2,r1,#1\n\
mov r4,r6\n\
partition_loop:\n\
ldr r3,[r0, r2, asl #2]\n\
cmp r3,r7\n\
addle r2,r2, #1\n\
ble partition_test\n\
sub r4,r4, #1\n\
ldr r5,[r0, r4, asl #2]\n\
str r5,[r0, r2, asl #2]\n\
str r3,[r0, r4, asl #2]\n\
partition_test:\n\
cmp r2,r4\n\
blt partition_loop\n\
partition_finish:\n\
sub r2,r2,#1\n\
ldr r3,[r0,r2,asl #2]\n\
str r3,[r0,r1,asl #2]\n\
str r7,[r0,r2,asl #2]\n\
bl qsort\n\
mov r1,r4\n\
b qsort_tailcall_entry\n\
"
);
int main(void){
uint32_t intarray[10] = {5,2,5,1,7,5,7,2,3,8};
Quicksort(intarray);
return 0;
}
Since you mentioned that you are compiling with gcc, you could use the gcc asm extension (as the name says, it's a gcc extension and might not be compatible with other compilers). Take a look at basic asm and extended asm. Since you will probably be accessing data from your C code, I advise you to stick with the advanced version which lets you specify memory operands.
I'm working on writing a program running on Cortex-m3.
At first I wrote an assembly file which executes 'svc'.
svc:
svc 0
bx lr
I decided to use gcc's inline asm, so I wrote it as follows, but the svc function was not inlined.
__attribute__((naked))
int svc(int no, ...)
{
(void)no;
asm("svc 0\n\tbx lr");
}
int f() {
return svc(0,1,2);
}
------------------ generated assembly ------------------
svc:
svc 0
bx lr
f:
mov r0, #0
mov r1, #1
mov r2, #2
b svc
I guess it's not inlined since it is naked, so I dropped the naked attribute and wrote like this.
int svc(int __no, ...)
{
register int no asm("r0") = __no;
register int ret asm("r0");
asm("svc 0" : "=r"(ret) : "r"(no));
return ret;
}
------------------ generated assembly ------------------
svc:
stmfd sp!, {r0, r1, r2, r3}
ldr r0, [sp]
add sp, sp, #16
svc 0
bx lr
f:
mov r0, #0 // missing instructions setting r1 and r2
svc 0
bx lr
Although I don't know why gcc adds some unnecessary stack operations, svc is good. The problem is that svc is not inlined properly, the variadic parameters were dropped.
Is there any svc primitive in gcc? If gcc does not have one, how do I write the right one?
Have a look at the syntax that is used in core_cmFunc.h which is supplied as part of the ARM CMSIS for the Cortex-M family. Here's an example that writes a value to the Priority Mask Register:
__attribute__ ((always_inline)) static inline void __set_PRIMASK(uint32_t priMask)
{
__ASM volatile ("MSR primask, %0"::"r" (priMask));
}
However, creating a variadic function like this sounds difficult.
You can use a macro like this.
#define __svc(sNum) __asm volatile("SVC %0" ::"M" (sNum))
And use it just like any compiler-primitive function, __svc(2);.
Since it is just a macro, it will only generate the provided instruction.
I'm very confused on how to recursively implement a method to print 100 (1..2..3..100) using ARM assembly.. I have the C code to do this and the C is very simple, but the assembly is a lot more and I have no clue how to do it.
Please help?
Thanks!
print100_recursive_ARM:
push {r4-r11, ip, lr}
CMP r0, #0
BEQ print_zero
SUB r0, r0, #1
BL print100_recursive_ARM
pop {r4-r11, ip, lr}
B print_num
print_num:
print_zero:
constant: .ascii "%d "
Print_ARM:
MOV r1, r0
LDR r0, =constant
BL printf
end:
pop {r4-r11, ip, lr}
BX lr
And this doesn't work.
Dirty trick: Write it in C, compile with e.g. gcc -S source.c, and analyze how the compiler did it (in source.s now). How to call/return, how to handle local variables, is typically complex. This way you get working assembler source to study or modify.