arm Inputting an integer using system interrupt (swi 0)

arm Inputting an integer using system interrupt (swi 0) - arm

does anyone know the code for inputting an integer in ARM? this is what I have so far
.text .global main
main:
mov r0, #0
ldr r1, =inputint
mov r2, #4
ldr R7, #3
swi 0
.data
inputint: .asciz "%d"

Related

ARM assembly routine, test, that accepts two arguments, r0 and r1, calls bit_pos for each argument and then adds the results

I've been attempting to write some assembly code that should be equivalent to this C code,
int test( unsigned int v1, unsigned int v2 )
{
int res1, res2;
res1 = bit_pos( v1 );
res1 = bit_pos( v2 );
return res1 + res2;
}
Test should accept two arguments, r0 and r1, calls bit_pos for each argument and then adds the results.
My current progress is the following :-
.arch armv4
.syntax unified
.arm
.text
.align 2
.type bit_pos, %function
.global bit_pos
bit_pos:
mov r1,r0
mov r0, #1
top:
cmp r1,#0
beq done
add r0,r0,#1
lsr r1, #1
b top
done:
mov pc, lr
.align 2
.type test, %function
.global test
test:
push {r0, r1, r2, lr}
mov r0, #0x80000000
bl bit_pos
mov r1, r0
mov r0, #0x00000001
bl bit_pos
mov r2, r0
pop {r0, r1, r2, lr}
mov pc, lr
I tried multiple attempts for the test function but the results always fail to pass, the issue is in the test function but the bit_pos is performing properly.
I need to pass the following test cases
checking 5 20 res=6
checking 1 0 res=-1
checking 175 100000 res=21
but currently I'm failing with
checking 5 20 res=5
got 5, expected 6
I'm really horrible at assembly but I promise I given this my best shot, It's been 5 hours and I'm tired.
My current attempt
test:
push {r4, lr}
mov r4, r0
bl bit_pos
bl bit_pos
add r0, r4
pop {r4, lr}
mov pc, lr
checking 5 20 res=6
checking 1 0 res=0
got 0, expected -1
Test failed

ARM assembly output isn't functioning correctly

This is supposed to output the contents of each line in arm assembly. Though line 18 add r4, r5, r4, lsl #1 isn't being outputted correctly and I am not sure why.
.data
str1: .asciz "%d and %d are the results \n"
n: .word word 1
.text
.global main
main: stmfd sp!, {lr}
ldr r4,=n
ldr r4, [r4]
add r4,r4, #1
mov r1, r4
ldr r0, =str1
bl printf
mov r5, r4
mov r1, r4
mov r2, r5
ldr r0, =str1
bl printf
add r4, r5, r4, lsl #1
mov r1, r4
ldr r0, = str1
bl printf
ldmfd sp!, {lr}
mov r0, #0
mov PC, or
.end

LDR pseudoinstruction

when I create ARM assembly code from C code with gcc -S, I get a variant of the LDR instruction that I don't know. Specifically, I get the "ldr r3, .L5" instruction where ".L5" is a lable defined by the compiler. It is not clear to me why I don't get the pseudoinstruction "ldr r3, =.L5", which should be the only way to load an arbitrary number in a register.
More in details:
I start from this C code (file name: sum_squares_C.c):
int sum;
int main(){
sum = 0;
for(int i=1; i<=n; i++){
sum = sum + i*i;
}
}
Then on a Raspeberry PI, I compile with "gcc -O0 -S sum_squares_C.c", with compiler version gcc (Raspbian 8.3.0-6+rpi1) 8.3.0.
The output is this ARM code (the instruction "ldr r3, .L5" is in the 7th line after label "main"):
.arch armv6
.eabi_attribute 28, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "sum_squares_C.c"
.text
.global n
.data
.align 2
.type n, %object
.size n, 4
n:
.word 1
.comm sum,4,4
.text
.align 2
.global main
.arch armv6
.syntax unified
.arm
.fpu vfp
.type main, %function
main:
# args = 0, pretend = 0, frame = 8
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
ldr r3, .L5
mov r2, #0
str r2, [r3]
mov r3, #1
str r3, [fp, #-8]
b .L2
.L3:
ldr r3, [fp, #-8]
ldr r2, [fp, #-8]
mul r2, r2, r3
ldr r3, .L5
ldr r3, [r3]
add r3, r2, r3
ldr r2, .L5
str r3, [r2]
ldr r3, [fp, #-8]
add r3, r3, #1
str r3, [fp, #-8]
.L2:
ldr r3, .L5+4
ldr r3, [r3]
ldr r2, [fp, #-8]
cmp r2, r3
ble .L3
mov r3, #0
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L6:
.align 2
.L5:
.word sum
.word n
.size main, .-main
.ident "GCC: (Raspbian 8.3.0-6+rpi1) 8.3.0"
.section .note.GNU-stack,"",%progbits
It seems to me that gcc uses the instruction "ldr r3, .L5" as equivalent to "ldr r3, =.L5". Is it correct? Where can I find the definition of this instruction syntax? Is it possible to force gcc to not use this instruction, but use "ldr r3, =.L5" (I need this for teaching reasons)?
Thanks!
Francesco

ldr r3, .L5 loads a word from the address .L5 into r3. At the label .L5 there is the address of the variable sum. So this loads the address of sum into r3.
ldr r3, =.L5 loads the address of .L5 into r3. Then the program would need to dereference it again in order to get the address of sum. There is no reason to do this.
When you use ldr r3, =.L5 the assembler stores the address of .L5 somewhere, and then loads from that address. So this:
ldr r3, =.L5
...
.L5:
.word sum
is the same as this:
ldr r3, .address_of_L5
...
.L5:
.word sum
...
.address_of_L5:
.word .L5
As you can see, the compiler has already done this for sum. Instead of writing this assembly:
ldr r3, =sum
the compiler has written:
ldr r3, .L5
...
.L5:
.word sum
which is exactly what the assembler would have done anyway. I don't know why the compiler wants to do this instead of the assembler.
It is not clear to me why I don't get the pseudoinstruction "ldr r3, =.L5", which should be the only way to load an arbitrary number in a register.
Notice this is not the only way to load an arbitrary number into a register. It's not even a real way to load an arbitrary number into a register. It's a pseudoinstruction (as you know): it's not something the CPU can actually do, it's something that the assembler can "compile" for your convenience.

To save typing and assume a risk a person might use:
ldr r3,=sum
ldr r3,[r3]
As pointed out in the other example the assembler will create in machine code the equivalent of what the human could have typed without the =address trick:
ldr r3,address_of_sum (without the =)
ldr r3,[r3]
...
address_of_sum: .word sum
And that first ldr (not pseudo as it translates directly into a known instruction, one to one) is a pc-relative load (assuming it can reach).
Both of these though are assembler specific as assembly language is defined by the assembler not the target.
The =address shortcut is not supported by all arm assemblers and should be used with care, for certain values it does not turn into a word in the pool with a pc relative load.
For questions like this first examine the disassembly, most of the time that will answer your question, even better examine the dissasembly first then in question the assembly. Compiler generated assembly is not as easy to read and follow as a disassembly, especially when linked. It is also easier to learn from optimized code than unoptimized as so much of the code is this stack (or in this case global) variable stuff.
ldr r3,=0x1000
ldr r3,=0x1234
b .
00000000 <.text>:
0: e3a03a01 mov r3, #4096 ; 0x1000
4: e51f3000 ldr r3, [pc, #-0] ; c <.text+0xc>
8: eafffffe b 8 <.text+0x8>
c: 00001234 andeq r1, r0, r4, lsr r2
In one case where it can it generates a mov, where it cant then it allocates from the pool and places the value there then does a pc relative load. Now yes when reading the output this way you need to see/understand/ignore the andeq disassembly that line we are looking at the value 0x00001234 and seeing the instruction generated.
You should not always assume the =address trick will work if you choose to try various tools, it works for gnu now if it can find a pool if it can't then you either need to just do the typing yourself or add a .pool or whatever the other pseudocode that does the same thing is to help the assembler find a place for this value as needed.
I would expect an assembler to always place this (=address) in the pool for an external reference, but it is technically possible for a toolchain to put a placeholder there and let the linker fill it in either with a mov or add a nearby item and place the value there like binutils does with a bl to an external reference.
gas:
ldr r3,=sum
b .
00000000 <.text>:
0: e51f3000 ldr r3, [pc, #-0] ; 8 <.text+0x8>
4: eafffffe b 4 <.text+0x4>
8: 00000000 andeq r0, r0, r0
The linker will fill in the address later as with your compiler output. Now the -0 disassembly is very interesting, almost amusing.

ARM Assembly Arrays

I am trying to figure out how arrays work in ARM assembly, but I am just overwhelmed. I want to initialize an array of size 20 to 0, 1, 2 and so on.
A[0] = 0
A[1] = 1
I can't even figure out how to print what I have to see if I did it correctly. This is what I have so far:
.data
.balign 4 # Memory location divisible by 4
string: .asciz "a[%d] = %d\n"
a: .skip 80 # allocates 20
.text
.global main
.extern printf
main:
push {ip, lr} # return address + dummy register
ldr r1, =a # set r1 to index point of array
mov r2, #0 # index r2 = 0
loop:
cmp r2, #20 # 20 elements?
beq end # Leave loop if 20 elements
add r3, r1, r2, LSL #2 # r3 = r1 + (r2*4)
str r2, [r3] # r3 = r2
add r2, r2, #1 # r2 = r2 + 1
b loop # branch to next loop iteration
print:
push {lr} # store return address
ldr r0, =string # format
bl printf # c printf
pop {pc} # return address
ARM confuses me enough as it is, I don't know what i'm doing wrong. If anyone could help me better understand how this works that would be much appreciated.

This might help down the line for others who want to know about how to allocate memory for array in arm assembly language
here is a simple example to add corresponding array elements and store in the third array.
.global _start
_start:
MOV R0, #5
LDR R1,=first_array # loading the address of first_array[0]
LDR R2,=second_array # loading the address of second_array[0]
LDR R7,=final_array # loading the address of final_array[0]
MOV R3,#5 # len of array
MOV R4,#0 # to store sum
check:
cmp R3,#1 # like condition in for loop for i>1
BNE loop # if R3 is not equal to 1 jump to the loop label
B _exit # else exit
loop:
LDR R5,[R1],#4 # loading the values and storing in registers and base register gets updated automatically R1 = R1 + 4
LDR R6,[R2],#4 # similarly
add R4,R5,R6
STR R4,[R7],#4 # storing the values back to the final array
SUB R3,R3,#1 # decrment value just like i-- in for loop
B check
_exit:
LDR R7,=final_array # before exiting checking the values stored
LDR R1, [R7] # R1 = 60
LDR R2, [R7,#4] # R2 = 80
LDR R3, [R7,#8] # R3 = 100
LDR R4, [R7,#12] # R4 = 120
MOV R7, #1 # terminate syscall, 1
SWI 0 # execute syscall
.data
first_array: .word 10,20,30,40
second_array: .word 50,60,70,80
final_array: .word 0,0,0,0,0

as mentioned your printf has problems, you can use the toolchain itself to see what the calling convention is, and then conform to that.
#include <stdio.h>
unsigned int a,b;
void notmain ( void )
{
printf("a[%d] = %d\n",a,b);
}
giving
00001008 <notmain>:
1008: e59f2010 ldr r2, [pc, #16] ; 1020 <notmain+0x18>
100c: e59f3010 ldr r3, [pc, #16] ; 1024 <notmain+0x1c>
1010: e5921000 ldr r1, [r2]
1014: e59f000c ldr r0, [pc, #12] ; 1028 <notmain+0x20>
1018: e5932000 ldr r2, [r3]
101c: eafffff8 b 1004 <printf>
1020: 0000903c andeq r9, r0, ip, lsr r0
1024: 00009038 andeq r9, r0, r8, lsr r0
1028: 0000102c andeq r1, r0, ip, lsr #32
Disassembly of section .rodata:
0000102c <.rodata>:
102c: 64255b61 strtvs r5, [r5], #-2913 ; 0xb61
1030: 203d205d eorscs r2, sp, sp, asr r0
1034: 000a6425 andeq r6, sl, r5, lsr #8
Disassembly of section .bss:
00009038 <b>:
9038: 00000000 andeq r0, r0, r0
0000903c <a>:
903c:
the calling convention is generally first parameter in r0, second in r1, third in r2 up to r3 then use the stack. There are many exceptions to this, but we can see here that the compiler which normally works fine with a printf call, wants the address of the format string in r0. the value of a then the value of b in r1 and r2 respectively.
Your printf has the string in r0, but a printf call with that format string needs three parameters.
The code above used a tail optimization and branch to printf rather than called it and returned from. The arm convention these days prefers the stack to be aligned on 64 bit boundaries, so you can put some register, you dont necessarily care to preserve on the push/pop in order to keep that alignment
push {r3,lr}
...
pop {r3,pc}
It certainly wont hurt you to do this, it may or may not hurt to not do it depending on what downstream assumes.
Your setup and loop should function just fine assuming that r1 (label a) is a word aligned address. Which it may or may not be if you mess with your string, should put a first then the string or put another alignment statement before a to insure the array is aligned. There are instruction set features that can simply the code, but it appears functional as is.

ARM-C Inter-working

I am trying out a simple program for ARM-C inter-working. Here is the code:
#include<stdio.h>
#include<stdlib.h>
int Double(int a);
extern int Start(void);
int main(){
int result=0;
printf("in C main\n");
result=Start();
printf("result=%d\n",result);
return 0;
}
int Double(int a)
{
printf("inside double func_argument_value=%d\n",a);
return (a*2);
}
The assembly file goes as-
.syntax unified
.cpu cortex-m3
.thumb
.align
.global Start
.global Double
.thumb_func
Start:
mov r10,lr
mov r0,#42
bl Double
mov lr,r10
mov r2,r0
mov pc,lr
During debugging on LPC1769(embedded artists board), I get an hardfault error on the instruction " result=Start(). " I am trying to do an arm-C internetworking here. the lr value during the execution of the above the statement(result=Start()) is 0x0000029F, where the faulting instruction is,and the pc value is 0x0000029E.
This is how I got the faulting instruction in r1
__asm("mrs r0,MSP\n"
"isb\n"
"ldr r1,[r0,#24]\n");
Can anybody please explain where I am going wrong? Any solution is appreciated.
Thank you in advance.
I am a beginner in cortex-m3 & am using the NXP LPCXpresso IDE powered by Code_Red.
Here is the disassembly of my code.
IntDefaultHandler:
00000269: push {r7}
0000026b: add r7, sp, #0
0000026d: b.n 0x26c <IntDefaultHandler+4>
0000026f: nop
00000271: mov r3, lr
00000273: mov.w r0, #42 ; 0x2a
00000277: bl 0x2c0 <Double>
0000027b: mov lr, r3
0000027d: mov r2, r0
0000027f: mov pc, lr
main:
00000281: push {r7, lr}
00000283: sub sp, #8
00000285: add r7, sp, #0
00000287: mov.w r3, #0
0000028b: str r3, [r7, #4]
0000028d: movw r3, #11212 ; 0x2bcc
00000291: movt r3, #0
00000295: mov r0, r3
00000297: bl 0xd64 <printf>
0000029b: bl 0x270 <Start>
0000029f: mov r3, r0
000002a1: str r3, [r7, #4]
000002a3: movw r3, #11224 ; 0x2bd8
000002a7: movt r3, #0
000002ab: mov r0, r3
000002ad: ldr r1, [r7, #4]
000002af: bl 0xd64 <printf>
000002b3: mov.w r3, #0
000002b7: mov r0, r3
000002b9: add.w r7, r7, #8
000002bd: mov sp, r7
000002bf: pop {r7, pc}
Double:
000002c0: push {r7, lr}
000002c2: sub sp, #8
000002c4: add r7, sp, #0
000002c6: str r0, [r7, #4]
000002c8: movw r3, #11236 ; 0x2be4
000002cc: movt r3, #0
000002d0: mov r0, r3
000002d2: ldr r1, [r7, #4]
000002d4: bl 0xd64 <printf>
000002d8: ldr r3, [r7, #4]
000002da: mov.w r3, r3, lsl #1
000002de: mov r0, r3
000002e0: add.w r7, r7, #8
000002e4: mov sp, r7
000002e6: pop {r7, pc}
As per your advice Dwelch, I have changed the r10 to r3.

I assume you mean interworking not internetworking? The LPC1769 is a cortex-m3 which is thumb/thumb2 only so it doesnt support arm instructions so there is no interworking available for that platform. Nevertheless, playing with the compiler to see what goes on:
Get the compiler to do it for you first, then try it yourself in asm...
start.s
.thumb
.globl _start
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
two.c
unsigned int two ( unsigned int t )
{
return(t+5);
}
Makefile
hello.list : start.s hello.c two.c
arm-none-eabi-as -mthumb start.s -o start.o
arm-none-eabi-gcc -c -O2 hello.c -o hello.o
arm-none-eabi-gcc -c -O2 -mthumb two.c -o two.o
arm-none-eabi-ld -Ttext=0x1000 start.o hello.o two.o -o hello.elf
arm-none-eabi-objdump -D hello.elf > hello.list
clean :
rm -f *.o
rm -f *.elf
rm -f *.list
produces hello.list
Disassembly of section .text:
00001000 <_start>:
1000: 4801 ldr r0, [pc, #4] ; (1008 <hang+0x2>)
1002: 46fe mov lr, pc
1004: 4700 bx r0
00001006 <hang>:
1006: e7fe b.n 1006 <hang>
1008: 0000100c andeq r1, r0, ip
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
...
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
hello.o disassembled by itself:
00000000 <hello>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <two>
8: e8bd4008 pop {r3, lr}
c: e2800007 add r0, r0, #7
10: e12fff1e bx lr
the compiler uses bl assuming/hoping it will be calling arm from arm. but it didnt, so what they did was put a trampoline in there.
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
the bl to __two_from_arm is an arm mode to arm mode branch link. the address of the destination function (two) with the lsbit set, which tells bx to switch to thumb mode, is loaded into the disposable register ip (r12?) then the bx ip happens switching modes. the branch link had setup the return address in lr, which was an arm mode address no doubt (lsbit zero).
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
the two() function does its thing and returns, note you have to use bx lr not mov pc,lr when interworking. Basically if you are not running an ARMv4 without the T, or an ARMv5 without the T, mov pc,lr is an okay habit. But anything ARMv4T or newer (ARMv5T or newer) use bx lr to return from a function unless you have a special reason not to. (avoid using pop {pc} as well for the same reason unless you really need to save that instruction and are not interworking). Now being on a cortex-m3 which is thumb+thumb2 only, well you cant interwork so you can use mov pc,lr and pop {pc}, but the code is not portable, and it is not a good habit as that habit will bite you when you switch back to arm programming.
So since hello was in arm mode when it used bl which is what set the link register, the bx in two_from_arm does not touch the link register, so when two() returns with a bx lr it is returning to arm mode after the bl __two_from_arm line in the hello() function.
Also note the extra 0x0000 after the thumb function, this was to align the program on a word boundary so that the following arm code was aligned...
to see how the compiler does thumb to arm change two as follows
unsigned int three ( unsigned int );
unsigned int two ( unsigned int t )
{
return(three(t)+5);
}
and put that function in hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
unsigned int three ( unsigned int t )
{
return(t+3);
}
and now we get another trampoline
00001028 <two>:
1028: b508 push {r3, lr}
102a: f000 f80b bl 1044 <__three_from_thumb>
102e: 3005 adds r0, #5
1030: bc08 pop {r3}
1032: bc02 pop {r1}
1034: 4708 bx r1
1036: 46c0 nop ; (mov r8, r8)
...
00001044 <__three_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eafffff4 b 1020 <three>
104c: 00000000 andeq r0, r0, r0
Now this is a very cool trampoline. the bl to three_from_thumb is in thumb mode and the link register is set to return to the two() function with the lsbit set no doubt to indicate to return to thumb mode.
The trampoline starts with a bx pc, pc is set to two instructions ahead and the pc internally always has the lsbit clear so a bx pc will always take you to arm mode if not already in arm mode, and in either mode two instructions ahead. Two instructions ahead of the bx pc is an arm instruction that branches (not branch link!) to the three function, completing the trampoline.
Notice how I wrote the call to hello() in the first place
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
that actually wont work will it? It will get you from arm to thumb but not from thumb to arm. I will leave that as an exercise for the reader.
If you change start.s to this
.thumb
.globl _start
_start:
bl hello
hang : b hang
the linker takes care of us:
00001000 <_start>:
1000: f000 f820 bl 1044 <__hello_from_thumb>
00001004 <hang>:
1004: e7fe b.n 1004 <hang>
...
00001044 <__hello_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eaffffee b 1008 <hello>
I would and do always disassemble programs like these to make sure the compiler and linker resolved these issues. Also note that for example __hello_from_thumb can be used from any thumb function, if I call hello from several places, some arm, some thumb, and hello was compiled for arm, then the arm calls would call hello directly (if they can reach it) and all the thumb calls would share the same hello_from_thumb (if they can reach it).
The compiler in these examples was assuming code that stays in the same mode (simple branch link) and the linker added the interworking code...
If you really meant inter-networking and not interworking, then please describe what that is and I will delete this answer.
EDIT:
You were using a register to preserve lr during the call to Double, that will not work, no register will work for that you need to use memory, and the easiest is the stack. See how the compiler does it:
00001008 <hello>:
1008: e92d4008 push {r3, lr}
100c: eb000009 bl 1038 <__two_from_arm>
1010: e8bd4008 pop {r3, lr}
1014: e2800007 add r0, r0, #7
1018: e12fff1e bx lr
r3 is pushed likely to align the stack on a 64 bit boundary (makes it faster). the thing to notice is the link register is preserved on the stack, but the pop does not pop to pc because this is not an ARMv4 build, so a bx is needed to return from the function. Because this is arm mode we can pop to lr and simply bx lr.
For thumb you can only push r0-r7 and lr directly and pop r0-r7 and pc directly you dont want to pop to pc because that only works if you are staying in the same mode (thumb or arm). this is fine for a cortex-m, or fine if you know what all of your callers are, but in general bad. So
00001024 <two>:
1024: b508 push {r3, lr}
1026: f000 f811 bl 104c <__three_from_thumb>
102a: 3005 adds r0, #5
102c: bc08 pop {r3}
102e: bc02 pop {r1}
1030: 4708 bx r1
same deal r3 is used as a dummy register to keep the stack aligned for performance (I used the default build for gcc 4.8.0 which is likely a platform with a 64 bit axi bus, specifying the architecture might remove that extra register). Because we cannot pop pc, I assume because r1 and r3 would be out of order and r3 was chosen (they could have chosen r2 and saved an instruction) there are two pops, one to get rid of the dummy value on the stack and the other to put the return value in a register so that they can bx to it to return.
Your Start function does not conform to the ABI and as a result when you mix it in with such large libraries as a printf call, no doubt you will crash. If you didnt it was dumb luck. Your assembly listing of main shows that neither r4 nor r10 were used and assuming main() is not called other than the bootstrap, then that is why you got away with either r4 or r10.
If this really is an LPC1769 this this whole discussion is irrelevant as it does not support ARM and does not support interworking (interworking = mixing of ARM mode code and thumb mode code). Your problem was unrelated to interworking, you are not interworking (note the pop {pc} at the end of the functions). Your problem was likely related to your assembly code.
EDIT2:
Changing the makefile to specify the cortex-m
00001008 <hello>:
1008: b508 push {r3, lr}
100a: f000 f805 bl 1018 <two>
100e: 3007 adds r0, #7
1010: bd08 pop {r3, pc}
1012: 46c0 nop ; (mov r8, r8)
00001014 <three>:
1014: 3003 adds r0, #3
1016: 4770 bx lr
00001018 <two>:
1018: b508 push {r3, lr}
101a: f7ff fffb bl 1014 <three>
101e: 3005 adds r0, #5
1020: bd08 pop {r3, pc}
1022: 46c0 nop ; (mov r8, r8)
first and foremost it is all thumb since there is no arm mode on a cortex-m, second the bx is not needed for function returns (Because there are no arm/thumb mode changes). So pop {pc} will work.
it is curious that the dummy register is still used on a push, I tried an arm7tdmi/armv4t build and it still did that, so there is some other flag to use to get rid of that behavior.
If your desire was to learn how to make an assembly function that you can call from C, you should have just done that. Make a C function that somewhat resembles the framework of the function you want to create in asm:
extern unsigned int Double ( unsigned int );
unsigned int Start ( void )
{
return(Double(42));
}
assemble then disassemble
00000000 <Start>:
0: b508 push {r3, lr}
2: 202a movs r0, #42 ; 0x2a
4: f7ff fffe bl 0 <Double>
8: bd08 pop {r3, pc}
a: 46c0 nop ; (mov r8, r8)
and start with that as you assembly function.
.globl Start
.thumb_func
Start:
push {lr}
mov r0, #42
bl Double
pop {pc}
That, or read the arm abi for gcc and understand what registers you can and cant use without saving them on the stack, what registers are used for passing and returning parameters.

_start function is the entry point of a C program which makes a call to main(). In order to debug _start function of any C program is in the assembly file. Actually the real entry point of a program on linux is not main(), but rather a function called _start(). The standard libraries normally provide a version of this that runs some initialization code, then calls main().
Try compiling this with gcc -nostdlib: