comprehending how "volatile" keyword and comparison work

comprehending how "volatile" keyword and comparison work - c

If a variable is not specified with the keyword volatile, the compiler likely does caching. The variable must be accessed from memory always otherwise until its transaction unit ends. The point I wonder lies in assembly part.
int main() {
/* volatile */ int lock = 999;
while (lock);
}
On x86-64-clang-3.0.0 compiler, its assembly code is following.
main: # #main
mov DWORD PTR [RSP - 4], 0
mov DWORD PTR [RSP - 8], 999
.LBB0_1: # =>This Inner Loop Header: Depth=1
cmp DWORD PTR [RSP - 8], 0
je .LBB0_3
jmp .LBB0_1
.LBB0_3:
mov EAX, DWORD PTR [RSP - 4]
ret
When volatile keyword is commented in, it turns out the following.
main: # #main
mov DWORD PTR [RSP - 4], 0
mov DWORD PTR [RSP - 8], 999
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov EAX, DWORD PTR [RSP - 8]
cmp EAX, 0
je .LBB0_3
jmp .LBB0_1
.LBB0_3:
mov EAX, DWORD PTR [RSP - 4]
ret
The points I wonder and don't understand,
cmp DWORD PTR [RSP - 8], 0 . <---
Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?

It looks like you forgot to enable optimization. -O0 treats all variables (except register variables) pretty similarly to volatile for consistent debugging.
With optimization enabled, compilers can hoist non-volatile loads out of loops. while(locked); will compile similarly to source like
if (locked) {
while(1){}
}
Or since locked has a compile-time-constant initializer, the whole function should compile to jmp main (an infinite loop).
See MCU programming - C++ O2 optimization breaks while loop for more details.
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
Some compilers are worse at folding loads into memory operands for other instructions when you use volatile. I think that's why you're getting a separate mov load here; it's just a missed optimization.
(Although cmp [mem], imm might be less efficient. I forget if it can macro-fuse with a JCC or something. With a RIP-relative addressing mode it couldn't micro-fuse the load, but a register base is ok.)
cmp EAX, 0 is weird, I guess clang with optimization disabled doesn't look for test eax,eax as a peephole optimization for comparing against zero.
As #user3386109 commented, locked in a boolean context is equivalent to locked != 0 in C / C++.

The compiler doesn't know about caching, it is not a caching thing, it tells the compiler that the value may change between accesses. So to functionally implement our code it needs to perform the accesses we ask for in the order we ask them. Can't optimize out.
void fun1 ( void )
{
/* volatile */ int lock = 999;
while (lock) continue;
}
void fun2 ( void )
{
volatile int lock = 999;
while (lock) continue;
}
volatile int vlock;
int ulock;
void fun3 ( void )
{
while(vlock) continue;
}
void fun4 ( void )
{
while(ulock) continue;
}
void fun5 ( void )
{
vlock=3;
vlock=4;
}
void fun6 ( void )
{
ulock=3;
ulock=4;
}
I find it easier to see in arm... doesn't really matter.
Disassembly of section .text:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
00001004 <fun2>:
1004: e59f3018 ldr r3, [pc, #24] ; 1024 <fun2+0x20>
1008: e24dd008 sub sp, sp, #8
100c: e58d3004 str r3, [sp, #4]
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
101c: e28dd008 add sp, sp, #8
1020: e12fff1e bx lr
1024: 000003e7 andeq r0, r0, r7, ror #7
00001028 <fun3>:
1028: e59f200c ldr r2, [pc, #12] ; 103c <fun3+0x14>
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
103c: 00002000
00001040 <fun4>:
1040: e59f3014 ldr r3, [pc, #20] ; 105c <fun4+0x1c>
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
105c: 00002004
00001060 <fun5>:
1060: e3a01003 mov r1, #3
1064: e3a02004 mov r2, #4
1068: e59f3008 ldr r3, [pc, #8] ; 1078 <fun5+0x18>
106c: e5831000 str r1, [r3]
1070: e5832000 str r2, [r3]
1074: e12fff1e bx lr
1078: 00002000
0000107c <fun6>:
107c: e3a02004 mov r2, #4
1080: e59f3004 ldr r3, [pc, #4] ; 108c <fun6+0x10>
1084: e5832000 str r2, [r3]
1088: e12fff1e bx lr
108c: 00002004
Disassembly of section .bss:
00002000 <vlock>:
2000: 00000000
00002004 <ulock>:
2004: 00000000
First one is the most telling:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
Being a local variable that is initialized, and non volatile then the compiler can assume it won't change value between accesses so it can never change in the while loop, so this is essentially a while 1 loop. If the initial value had been zero this would be a simple return as it can never be non-zero, being non-volatile.
fun2 being a local variable a stack frame needs to be built then.
It does what one assumes the code was trying to do, wait for this shared variable, one that can change during the loop
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
so it samples it and tests what it samples each time through the loop.
fun3 and fun4 same deal but more realistic, as external to the function code isnt going to change lock, being non-global doesn't make much sense for your while loop.
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
For the volatile fun3 case the variable has to be read and tested each loop
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
For the non-volatile being global it has to sample it once, very interesting what the compiler did here, have to think about why it would do that, but either way you can see that the "loop" retests the value read stored in a register (not cached) which will never change with a proper program. Functionally we asked it to only read the variable once by using non-volatile then it tests that value indefinitely.
fun5 and fun6 further demonstrate that volatile requires the compiler perform the accesses to the variable in its storage place before moving on to the next operation/access in the code. So when volatile we are asking the compiler to perform two assignments, two stores. When non-volatile the compiler can optimize out the first store and only do the last one as if you look at the code as a whole this function (fun6) leaves the variable set to 4, so the function leaves the variable set to 4.
The x86 solution is equally interesting repz retq is all over it (with the compiler on my computer), not hard to find out what that is all about.
Neither aarch64, x86, mips, riscv, msp430, pdp11 backends do the double check on fun3().
pdp11 is actually the easier code to read (no surprise there)
00000000 <_fun1>:
0: 01ff br 0 <_fun1>
00000002 <_fun2>:
2: 65c6 fffe add $-2, sp
6: 15ce 03e7 mov $1747, (sp)
a: 1380 mov (sp), r0
c: 02fe bne a <_fun2+0x8>
e: 65c6 0002 add $2, sp
12: 0087 rts pc
00000014 <_fun3>:
14: 1dc0 0026 mov $3e <_vlock>, r0
18: 02fd bne 14 <_fun3>
1a: 0087 rts pc
0000001c <_fun4>:
1c: 1dc0 001c mov $3c <_ulock>, r0
20: 0bc0 tst r0
22: 02fe bne 20 <_fun4+0x4>
24: 0087 rts pc
00000026 <_fun5>:
26: 15f7 0003 0012 mov $3, $3e <_vlock>
2c: 15f7 0004 000c mov $4, $3e <_vlock>
32: 0087 rts pc
00000034 <_fun6>:
34: 15f7 0004 0002 mov $4, $3c <_ulock>
3a: 0087 rts pc
(this is the not linked version)
cmp DWORD PTR [RSP - 8], 0 . <--- Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
while does a true false comparison meaning is it equal to zero or not equal to zero
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
mov -0x8(%rsp),%eax
cmp 0,%eax
cmp 0,-0x8(%rsp)
as so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: too many memory references for `cmp'
compare wants a register. So it reads into a register so it can do the compare as it can't do the compare between the immediate and the memory access in one instruction. If they could have done it in one instruction they would have.

Related

ARM assembly routine, test, that accepts two arguments, r0 and r1, calls bit_pos for each argument and then adds the results

I've been attempting to write some assembly code that should be equivalent to this C code,
int test( unsigned int v1, unsigned int v2 )
{
int res1, res2;
res1 = bit_pos( v1 );
res1 = bit_pos( v2 );
return res1 + res2;
}
Test should accept two arguments, r0 and r1, calls bit_pos for each argument and then adds the results.
My current progress is the following :-
.arch armv4
.syntax unified
.arm
.text
.align 2
.type bit_pos, %function
.global bit_pos
bit_pos:
mov r1,r0
mov r0, #1
top:
cmp r1,#0
beq done
add r0,r0,#1
lsr r1, #1
b top
done:
mov pc, lr
.align 2
.type test, %function
.global test
test:
push {r0, r1, r2, lr}
mov r0, #0x80000000
bl bit_pos
mov r1, r0
mov r0, #0x00000001
bl bit_pos
mov r2, r0
pop {r0, r1, r2, lr}
mov pc, lr
I tried multiple attempts for the test function but the results always fail to pass, the issue is in the test function but the bit_pos is performing properly.
I need to pass the following test cases
checking 5 20 res=6
checking 1 0 res=-1
checking 175 100000 res=21
but currently I'm failing with
checking 5 20 res=5
got 5, expected 6
I'm really horrible at assembly but I promise I given this my best shot, It's been 5 hours and I'm tired.
My current attempt
test:
push {r4, lr}
mov r4, r0
bl bit_pos
bl bit_pos
add r0, r4
pop {r4, lr}
mov pc, lr
checking 5 20 res=6
checking 1 0 res=0
got 0, expected -1
Test failed

Example of a static vs automatic variable in assembly

In C, we can use the following two examples to show the difference between a static and non-static variable:
for (int i = 0; i < 5; ++i) {
static int n = 0;
printf("%d ", ++n); // prints 1 2 3 4 5 - the value persists
}
And:
for (int i = 0; i < 5; ++i) {
int n = 0;
printf("%d ", ++n); // prints 1 1 1 1 1 - the previous value is lost
}
Source: this answer.
What would be the most basic example in assembly to show the difference between how a static or non-static variable is created? (Or does this concept not exist in assembly?)

To implement a static object in assembly, you define it in a data section (of which there are various types, involving options for initialization and modification).
To implement an automatic object in assembly, you include space for it in the stack frame of a routine.
Examples, not necessarily syntactically correct in a particular assembly language, might be:
.data
foo: .word 34 // Static object named "foo".
.text
…
lr r3, foo // Load value of foo.
and:
.text
bar: // Start of routine named "bar".
foo = 16 // Define a symbol for convenience.
add sp, sp, -CalculatedSize // Allocate stack space for local data.
…
li r3, 34 // Load immediate value into register.
sr r3, foo(sp) // Store value into space reserved for foo on stack.
…
add sp, sp, +CalculatedSize // Automatic objects are released here.
ret
These are very simplified examples (as requested). Many modern schemes for using the hardware stack include frame pointers, which are not included above.
In the second example, CalculatedSize represents some amount that includes space for registers to be saved, space for the foo object, space for arguments for subroutine calls, and whatever other stack space is needed by the routine. The offset of 16 provided for foo is part of those calculations; the author of the routine would arrange their stack frame largely as they desire.

Just try it
void more_fun ( int );
void fun0 ( void )
{
for (int i = 0; i < 500; ++i) {
static int n = 0;
more_fun(++n);
}
}
void fun1 ( void )
{
for (int i = 0; i < 500; ++i) {
int n = 0;
more_fun( ++n);
}
}
Disassembly of section .text:
00000000 <fun0>:
0: e92d4070 push {r4, r5, r6, lr}
4: e3a04f7d mov r4, #500 ; 0x1f4
8: e59f501c ldr r5, [pc, #28] ; 2c <fun0+0x2c>
c: e5953000 ldr r3, [r5]
10: e2833001 add r3, r3, #1
14: e1a00003 mov r0, r3
18: e5853000 str r3, [r5]
1c: ebfffffe bl 0 <more_fun>
20: e2544001 subs r4, r4, #1
24: 1afffff8 bne c <fun0+0xc>
28: e8bd8070 pop {r4, r5, r6, pc}
2c: 00000000
00000030 <fun1>:
30: e92d4010 push {r4, lr}
34: e3a04f7d mov r4, #500 ; 0x1f4
38: e3a00001 mov r0, #1
3c: ebfffffe bl 0 <more_fun>
40: e2544001 subs r4, r4, #1
44: 1afffffb bne 38 <fun1+0x8>
48: e8bd8010 pop {r4, pc}
Disassembly of section .bss:
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
I like to think of static locals as local globals. They sit in .bss or .data just like globals. But from a C perspective they can only be accessed within the function/context that they were created in.
I local variable has no need for long term storage, so it is "created" and destroyed within that fuction. If we were to not optimize you would see that
some stack space is allocated.
00000064 <fun1>:
64: e92d4800 push {fp, lr}
68: e28db004 add fp, sp, #4
6c: e24dd008 sub sp, sp, #8
70: e3a03000 mov r3, #0
74: e50b300c str r3, [fp, #-12]
78: ea000009 b a4 <fun1+0x40>
7c: e3a03000 mov r3, #0
80: e50b3008 str r3, [fp, #-8]
84: e51b3008 ldr r3, [fp, #-8]
88: e2833001 add r3, r3, #1
8c: e50b3008 str r3, [fp, #-8]
90: e51b0008 ldr r0, [fp, #-8]
94: ebfffffe bl 0 <more_fun>
98: e51b300c ldr r3, [fp, #-12]
9c: e2833001 add r3, r3, #1
a0: e50b300c str r3, [fp, #-12]
a4: e51b300c ldr r3, [fp, #-12]
a8: e3530f7d cmp r3, #500 ; 0x1f4
ac: bafffff2 blt 7c <fun1+0x18>
b0: e1a00000 nop ; (mov r0, r0)
b4: e24bd004 sub sp, fp, #4
b8: e8bd8800 pop {fp, pc}
But optimized for fun1 the local variable is kept in a register, faster than keeping on the stack, in this solution they save the upstream value held in r4 so
that r4 can be used to hold n within this function, when the function returns there is no more need for n per the rules of the language.
For the static local, per the rules of the language that value remains static
outside the function and can be accessed within. Because it is initialized to 0 it lives in .bss not .data (gcc, and many others). In the code above the linker
will fill this value
2c: 00000000
in with the address to this
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
IMO one could argue the implementation didnt need to treat it like a volatile and
sample and save n every loop. Could have basically implemented it like the second
function, but sampled up front from memory and saved it in the end. Either way
you can see the difference in an implementation of the high level code. The non-static local only lives within the function and then its storage anc contents
are essentially gone.

When modifying a variable, the static local variable modified by static is executed only once, and the life cycle of the local variable is extended until the program is run.
If you don't add static, each loop reallocates the value.

R7 and R11 relation with Link Register in ARM architecture (thumb/arm) calling convention

I was looking at a arm assembly code generated by gcc, and I noticed that the GCC compiled a function with the following code:
0x00010504 <+0>: push {r7, lr}
0x00010506 <+2>: sub sp, #24
0x00010508 <+4>: add r7, sp, #0
0x0001050a <+6>: str r0, [r7, #4]
=> 0x0001050c <+8>: mov r3, lr
0x0001050e <+10>: mov r1, r3
0x00010510 <+12>: movw r0, #1664 ; 0x680
0x00010514 <+16>: movt r0, #1
0x00010518 <+20>: blx 0x10378 <printf#plt>
0x0001051c <+24>: add.w r3, r7, #12
0x00010520 <+28>: mov r0, r3
0x00010522 <+30>: blx 0x10384 <gets#plt>
0x00010526 <+34>: mov r3, lr
0x00010528 <+36>: mov r1, r3
0x0001052a <+38>: movw r0, #1728 ; 0x6c0
0x0001052e <+42>: movt r0, #1
0x00010532 <+46>: blx 0x10378 <printf#plt>
0x00010536 <+50>: adds r7, #24
0x00010538 <+52>: mov sp, r7
0x0001053a <+54>: pop {r7, pc}
The thing which was interesting for me was that, I see the GCC uses R7 to pop the values to PC instead of LR. I saw similar thing with R11. The compiler push the r11 and LR to the stack and then pop the R11 to the PC. should not LR act as return address instead of R7 or R11. Why does the R7 (which is a frame pointer in Thumb Mode) being used here?
If you look at apple ios calling convention it is even different. It uses other registers (e.g. r4 to r7) to PC to return the control. Should not it use LR?
Or I am missing something here?
Another question is that, it looks like that the LR, R11 or R7 values are never an immediate value to the return address. But a pointer to the stack which contain the return address. Is that right?
Another weird thing is that compiler does not do the same thing for function epoilogue. For example it might instead of using pop to PC use bx LR, but Why?

Well first off they likely want to keep the stack aligned on a 64 bit boundary.
R7 is better than anything greater for a frame pointer as registers r8 to r15 are not supported in most instructions. I would have to look I would assume there are special pc and sp offset load/store instructions so why would r7 be burned at all?
Not sure all you are asking, in thumb you can push lr but pop pc and I think that is equivalent to bx lr, but you have to look it up for each architecture as for some you cannot switch modes with pop. In this case it appears to assume that and not burn the extra instruction with a pop r3 bx r3 kind of thing. And actually to have done that would have likely needed to be two extra instructions pop r7, pop r3, bx r3.
So it may be a case that one compiler is told what architecture is being used and can assume pop pc is safe where another is not so sure. Again have to read the arm architecture docs for various architectures to know the variations on what instructions can be used to change modes and what cant. Perhaps if you walk through various architecture types with gnu it may change the way it returns.
EDIT
unsigned int morefun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int x, unsigned int y )
{
x+=1;
return(morefun(x,y+2)+3);
}
arm-none-eabi-gcc -O2 -mthumb -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bc10 pop {r4}
e: bc02 pop {r1}
10: 4708 bx r1
12: 46c0 nop ; (mov r8, r8)
arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m3 -march=armv7-m -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b508 push {r3, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd08 pop {r3, pc}
e: bf00 nop
just using that march without the mcpu gives the same result (doesnt pop the lr to r1 to bx).
march=armv5t changes it up slightly
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd10 pop {r4, pc}
e: 46c0 nop ; (mov r8, r8)
armv4t as expected does the pop and bx thing.
armv6-m gives what armv5t gave.
gcc version 6.1.0 built using --target=arm-none-eabi without any other arm specifier.
So likely as the OP is asking if I understand right they are probably seeing the three instruction pop pop bx rather than a single pop {rx,pc}. Or at least one compiler varies compared to another. Apple IOS was mentioned so it likely defaults to a heavier duty core than a works everywhere type of thing. And their gcc like mine defaults to the work everywhere (including the original ARMv4T) rather than work everywhere but the original. I assume if you add some command line options you will see the gcc compiler behave differently as I have demonstrated.
Note in these examples r3 and r4 are not used, why are they preserving them then? It is likely the first thing I mentioned keeping a 64 bit alignment on the stack. If for the all thumb variants solution if you get an interrupt between the pops then the interrupt handler is dealing with an unaligned stack. Since r4 was throwaway anyway they could have popped r1 and r2 or r2 and r3 and then bx r2 or bx r3 respectively and not had that moment where it was unaligned and saved an instruction. Oh well...

ARM ASSEMBLY loop through argv arguments

I am trying to pass arguments to my program at the command line using argv, i figured out how to point to the first argv address, but i cannot seem to loop to the next one.
here is my code, but I think that what is relevant is in the first subroutine:
.text
.global _start
.equ exit, 1
.equ write, 4
.equ stdout, 1
_start:
ldr r5, [sp] #argc value
ldr r6, =1
mov r8, #8 #argv address
0: ldr r4, [sp, r8]
add r8, r8, #4
mov r1, r4
adr r10, isbn10
adr r11, valid
adr r12, invalid
adr r13, isbn13
bl strlen
cmp r0, #13
beq 1f
cmp r0, #10
beq 2f
1: bl check_13
cmp r2,#0
bleq print13v
blne print13i
add r6, r6, #1
cmp r5,r6
bne 0b
mov r0, #0 # success exit code
mov r7, #exit
svc 0
2: bl check_10
cmp r2,#0
bleq print10v
blne print10i
add r6, r6, #1
cmp r5,r6
bne 0b
mov r0, #0 # success exit code
mov r7, #exit
svc 0 # return to os
val:.asciz "9780306406157"
isbn10:.asciz "\nisbn-10 : "
isbn13:.asciz "\nisbn-13 : "
valid:.asciz ": valid"
invalid:.asciz ": invalid"
.align 2
strlen:
mov r0, #0
# length to return
0:
ldrb r2, [r1], #1 # get current char and advance
cmp r2, #0 # are we at the end of the string?
addne r0, #1
bne 0b
mov pc, lr
#######################
check_13: #sum at r2
mov r1, r4
mov r3,#1 #toggle
mov r2,#0 #sum
0:
ldrb r0,[r1], #1
cmp r0, #0
beq 9f
cmp r0, #'0
blo 1f
cmp r0, #'9
bhi 1f
sub r0,r0,#'0
add r2,r2,r0
cmp r2, #10
subge r2,r2,#10
eors r3,r3,#1 #toggled?
addne r2,r2,r0,lsl#1
cmp r2, #10
subge r2,r2,#10
cmp r2, #10
subge r2,r2,#10
bal 0b
1: mov r2, #22 #returns r2=22 if invalid
mov pc,lr
9: mov pc,lr
##################
check_10: #sum at r2
mov r1, r4
mov r3,#0 #t
mov r2,#0 #sum
0:
ldrb r0,[r1], #1
cmp r0, #0 #end?
beq 9f
cmp r0, #'0
blo 1f
cmp r0, #'9
bhi 2f
sub r0,r0,#'0
bal 3f
3: add r3,r3,r0
cmp r3, #11
subge r3, r3, #11
add r2,r2,r3
cmp r2, #11
subge r2, r2, #11
bal 0b
2: and r0,r0, #0xdf # x becomes x
cmp r0, #0x58 # x?
bne 1f
mov r0,#10
bal 3b
1: mov r2, #22 #returns r2=22 if invalid
mov pc,lr
9: mov pc,lr
######################
invalid:
mov r2, #22 #returns r2=22 if invalid
mov pc,lr
#######################
print10v:
mov r9,lr
mov r1,r10
bl strlen
mov r2,r0
mov r1,r10
mov r0,#stdout
mov r7, #write #herehere
svc 0
mov r1,r4
bl strlen
mov r1,r4
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov r1,r11
bl strlen
mov r1,r11
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov pc,r9
#######################
print10i:
mov r9,lr
mov r1,r10
bl strlen
mov r2,r0
mov r1,r10
mov r0,#stdout
mov r7, #write #herehere
svc 0
mov r1,r4
bl strlen
mov r1,r4
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov r1,r12
bl strlen
mov r1,r12
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov pc,r9
#######################
print13v:
mov r9,lr
mov r1,r13
bl strlen
mov r2,r0
mov r1,r13
mov r0,#stdout
mov r7, #write #herehere
svc 0
mov r1,r4
bl strlen
mov r1,r4
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov r1,r11
bl strlen
mov r1,r11
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov pc,r9
#######################
print13i:
mov r9,lr
mov r1,r13
bl strlen
mov r2,r0
mov r1,r13
mov r0,#stdout
mov r7, #write #herehere
svc 0
mov r1,r4
bl strlen
mov r1,r4
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov r1,r12
bl strlen
mov r1,r12
mov r2,r0
mov r0,#stdout
mov r7, #write
svc 0
mov pc,r9
after i assemble and link it
i run it using ./validate 9780306406157 1234567890
ISBN-13 : 9780306406157: VALID
ISBN-13 : 306406157: INVALID[Inferior 1 (process 22221) exited normally]
meaning that r4 at the second time through the loop got 306406157, i wanted it to get 1234567890...
after doing the suggested editing i ran the program and it gave me a segmentation on line 60, when i try to read a byte (a char) from the new argument, i ran gdb and i noticed that the value of r4 (supposed to be argv[2] on the second time through the loop) is very far from the value in the first time through the loop
14 mov r1, r4
(gdb) i r
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0xbefff8d6 3204446422
r5 0x3 3
r6 0x1 1
r7 0x0 0
r8 0xc 12
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0xbefff790 0xbefff790
lr 0x0 0
pc 0x8068 0x8068 <_start+20>
cpsr 0x10 16
(gdb) c
Continuing.
isbn-13 : 9780306406157: valid
Breakpoint 1, _start () at validate.s:12
12 0: ldr r4, [sp, r8]
(gdb) stepi
13 add r8, r8, #4
(gdb)
14 mov r1, r4
(gdb) i r
r0 0x7 7
r1 0x8106 33030
r2 0x7 7
r3 0x0 0
r4 0x6176203a 1635131450
r5 0x3 3
r6 0x2 2
r7 0x4 4
r8 0x10 16
r9 0x809c 32924
r10 0x80ee 33006
r11 0x8106 33030
r12 0x81f4 33268
sp 0x80fa 0x80fa
lr 0x82f8 33528
pc 0x8068 0x8068 <_start+20>
cpsr 0x20000010 536870928
any help?

What you get when you do ldr r4, [sp, #8] is argv[1] (argv[0] which is at [sp, #4] is the name of the executing program).
So addne r4, r4, #4 will just move 4 bytes ahead within argv[1]. What you should do to load argv[2], argv[3], etc., is to read from [sp, #0xC], [sp, #0x10], etc.
Something like this:
mov r8, #8 # Offset of argv[1]
0: ldr r4, [sp, r8] r4 = argv[n]
add r8, r8, #4 n++
mov r1, r4

ARM-C Inter-working

I am trying out a simple program for ARM-C inter-working. Here is the code:
#include<stdio.h>
#include<stdlib.h>
int Double(int a);
extern int Start(void);
int main(){
int result=0;
printf("in C main\n");
result=Start();
printf("result=%d\n",result);
return 0;
}
int Double(int a)
{
printf("inside double func_argument_value=%d\n",a);
return (a*2);
}
The assembly file goes as-
.syntax unified
.cpu cortex-m3
.thumb
.align
.global Start
.global Double
.thumb_func
Start:
mov r10,lr
mov r0,#42
bl Double
mov lr,r10
mov r2,r0
mov pc,lr
During debugging on LPC1769(embedded artists board), I get an hardfault error on the instruction " result=Start(). " I am trying to do an arm-C internetworking here. the lr value during the execution of the above the statement(result=Start()) is 0x0000029F, where the faulting instruction is,and the pc value is 0x0000029E.
This is how I got the faulting instruction in r1
__asm("mrs r0,MSP\n"
"isb\n"
"ldr r1,[r0,#24]\n");
Can anybody please explain where I am going wrong? Any solution is appreciated.
Thank you in advance.
I am a beginner in cortex-m3 & am using the NXP LPCXpresso IDE powered by Code_Red.
Here is the disassembly of my code.
IntDefaultHandler:
00000269: push {r7}
0000026b: add r7, sp, #0
0000026d: b.n 0x26c <IntDefaultHandler+4>
0000026f: nop
00000271: mov r3, lr
00000273: mov.w r0, #42 ; 0x2a
00000277: bl 0x2c0 <Double>
0000027b: mov lr, r3
0000027d: mov r2, r0
0000027f: mov pc, lr
main:
00000281: push {r7, lr}
00000283: sub sp, #8
00000285: add r7, sp, #0
00000287: mov.w r3, #0
0000028b: str r3, [r7, #4]
0000028d: movw r3, #11212 ; 0x2bcc
00000291: movt r3, #0
00000295: mov r0, r3
00000297: bl 0xd64 <printf>
0000029b: bl 0x270 <Start>
0000029f: mov r3, r0
000002a1: str r3, [r7, #4]
000002a3: movw r3, #11224 ; 0x2bd8
000002a7: movt r3, #0
000002ab: mov r0, r3
000002ad: ldr r1, [r7, #4]
000002af: bl 0xd64 <printf>
000002b3: mov.w r3, #0
000002b7: mov r0, r3
000002b9: add.w r7, r7, #8
000002bd: mov sp, r7
000002bf: pop {r7, pc}
Double:
000002c0: push {r7, lr}
000002c2: sub sp, #8
000002c4: add r7, sp, #0
000002c6: str r0, [r7, #4]
000002c8: movw r3, #11236 ; 0x2be4
000002cc: movt r3, #0
000002d0: mov r0, r3
000002d2: ldr r1, [r7, #4]
000002d4: bl 0xd64 <printf>
000002d8: ldr r3, [r7, #4]
000002da: mov.w r3, r3, lsl #1
000002de: mov r0, r3
000002e0: add.w r7, r7, #8
000002e4: mov sp, r7
000002e6: pop {r7, pc}
As per your advice Dwelch, I have changed the r10 to r3.

I assume you mean interworking not internetworking? The LPC1769 is a cortex-m3 which is thumb/thumb2 only so it doesnt support arm instructions so there is no interworking available for that platform. Nevertheless, playing with the compiler to see what goes on:
Get the compiler to do it for you first, then try it yourself in asm...
start.s
.thumb
.globl _start
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
two.c
unsigned int two ( unsigned int t )
{
return(t+5);
}
Makefile
hello.list : start.s hello.c two.c
arm-none-eabi-as -mthumb start.s -o start.o
arm-none-eabi-gcc -c -O2 hello.c -o hello.o
arm-none-eabi-gcc -c -O2 -mthumb two.c -o two.o
arm-none-eabi-ld -Ttext=0x1000 start.o hello.o two.o -o hello.elf
arm-none-eabi-objdump -D hello.elf > hello.list
clean :
rm -f *.o
rm -f *.elf
rm -f *.list
produces hello.list
Disassembly of section .text:
00001000 <_start>:
1000: 4801 ldr r0, [pc, #4] ; (1008 <hang+0x2>)
1002: 46fe mov lr, pc
1004: 4700 bx r0
00001006 <hang>:
1006: e7fe b.n 1006 <hang>
1008: 0000100c andeq r1, r0, ip
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
...
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
hello.o disassembled by itself:
00000000 <hello>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <two>
8: e8bd4008 pop {r3, lr}
c: e2800007 add r0, r0, #7
10: e12fff1e bx lr
the compiler uses bl assuming/hoping it will be calling arm from arm. but it didnt, so what they did was put a trampoline in there.
0000100c <hello>:
100c: e92d4008 push {r3, lr}
1010: eb000004 bl 1028 <__two_from_arm>
1014: e8bd4008 pop {r3, lr}
1018: e2800007 add r0, r0, #7
101c: e12fff1e bx lr
00001028 <__two_from_arm>:
1028: e59fc000 ldr ip, [pc] ; 1030 <__two_from_arm+0x8>
102c: e12fff1c bx ip
1030: 00001021 andeq r1, r0, r1, lsr #32
1034: 00000000 andeq r0, r0, r0
the bl to __two_from_arm is an arm mode to arm mode branch link. the address of the destination function (two) with the lsbit set, which tells bx to switch to thumb mode, is loaded into the disposable register ip (r12?) then the bx ip happens switching modes. the branch link had setup the return address in lr, which was an arm mode address no doubt (lsbit zero).
00001020 <two>:
1020: 3005 adds r0, #5
1022: 4770 bx lr
1024: 0000 movs r0, r0
the two() function does its thing and returns, note you have to use bx lr not mov pc,lr when interworking. Basically if you are not running an ARMv4 without the T, or an ARMv5 without the T, mov pc,lr is an okay habit. But anything ARMv4T or newer (ARMv5T or newer) use bx lr to return from a function unless you have a special reason not to. (avoid using pop {pc} as well for the same reason unless you really need to save that instruction and are not interworking). Now being on a cortex-m3 which is thumb+thumb2 only, well you cant interwork so you can use mov pc,lr and pop {pc}, but the code is not portable, and it is not a good habit as that habit will bite you when you switch back to arm programming.
So since hello was in arm mode when it used bl which is what set the link register, the bx in two_from_arm does not touch the link register, so when two() returns with a bx lr it is returning to arm mode after the bl __two_from_arm line in the hello() function.
Also note the extra 0x0000 after the thumb function, this was to align the program on a word boundary so that the following arm code was aligned...
to see how the compiler does thumb to arm change two as follows
unsigned int three ( unsigned int );
unsigned int two ( unsigned int t )
{
return(three(t)+5);
}
and put that function in hello.c
extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
return(two(h)+7);
}
unsigned int three ( unsigned int t )
{
return(t+3);
}
and now we get another trampoline
00001028 <two>:
1028: b508 push {r3, lr}
102a: f000 f80b bl 1044 <__three_from_thumb>
102e: 3005 adds r0, #5
1030: bc08 pop {r3}
1032: bc02 pop {r1}
1034: 4708 bx r1
1036: 46c0 nop ; (mov r8, r8)
...
00001044 <__three_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eafffff4 b 1020 <three>
104c: 00000000 andeq r0, r0, r0
Now this is a very cool trampoline. the bl to three_from_thumb is in thumb mode and the link register is set to return to the two() function with the lsbit set no doubt to indicate to return to thumb mode.
The trampoline starts with a bx pc, pc is set to two instructions ahead and the pc internally always has the lsbit clear so a bx pc will always take you to arm mode if not already in arm mode, and in either mode two instructions ahead. Two instructions ahead of the bx pc is an arm instruction that branches (not branch link!) to the three function, completing the trampoline.
Notice how I wrote the call to hello() in the first place
_start:
ldr r0,=hello
mov lr,pc
bx r0
hang : b hang
that actually wont work will it? It will get you from arm to thumb but not from thumb to arm. I will leave that as an exercise for the reader.
If you change start.s to this
.thumb
.globl _start
_start:
bl hello
hang : b hang
the linker takes care of us:
00001000 <_start>:
1000: f000 f820 bl 1044 <__hello_from_thumb>
00001004 <hang>:
1004: e7fe b.n 1004 <hang>
...
00001044 <__hello_from_thumb>:
1044: 4778 bx pc
1046: 46c0 nop ; (mov r8, r8)
1048: eaffffee b 1008 <hello>
I would and do always disassemble programs like these to make sure the compiler and linker resolved these issues. Also note that for example __hello_from_thumb can be used from any thumb function, if I call hello from several places, some arm, some thumb, and hello was compiled for arm, then the arm calls would call hello directly (if they can reach it) and all the thumb calls would share the same hello_from_thumb (if they can reach it).
The compiler in these examples was assuming code that stays in the same mode (simple branch link) and the linker added the interworking code...
If you really meant inter-networking and not interworking, then please describe what that is and I will delete this answer.
EDIT:
You were using a register to preserve lr during the call to Double, that will not work, no register will work for that you need to use memory, and the easiest is the stack. See how the compiler does it:
00001008 <hello>:
1008: e92d4008 push {r3, lr}
100c: eb000009 bl 1038 <__two_from_arm>
1010: e8bd4008 pop {r3, lr}
1014: e2800007 add r0, r0, #7
1018: e12fff1e bx lr
r3 is pushed likely to align the stack on a 64 bit boundary (makes it faster). the thing to notice is the link register is preserved on the stack, but the pop does not pop to pc because this is not an ARMv4 build, so a bx is needed to return from the function. Because this is arm mode we can pop to lr and simply bx lr.
For thumb you can only push r0-r7 and lr directly and pop r0-r7 and pc directly you dont want to pop to pc because that only works if you are staying in the same mode (thumb or arm). this is fine for a cortex-m, or fine if you know what all of your callers are, but in general bad. So
00001024 <two>:
1024: b508 push {r3, lr}
1026: f000 f811 bl 104c <__three_from_thumb>
102a: 3005 adds r0, #5
102c: bc08 pop {r3}
102e: bc02 pop {r1}
1030: 4708 bx r1
same deal r3 is used as a dummy register to keep the stack aligned for performance (I used the default build for gcc 4.8.0 which is likely a platform with a 64 bit axi bus, specifying the architecture might remove that extra register). Because we cannot pop pc, I assume because r1 and r3 would be out of order and r3 was chosen (they could have chosen r2 and saved an instruction) there are two pops, one to get rid of the dummy value on the stack and the other to put the return value in a register so that they can bx to it to return.
Your Start function does not conform to the ABI and as a result when you mix it in with such large libraries as a printf call, no doubt you will crash. If you didnt it was dumb luck. Your assembly listing of main shows that neither r4 nor r10 were used and assuming main() is not called other than the bootstrap, then that is why you got away with either r4 or r10.
If this really is an LPC1769 this this whole discussion is irrelevant as it does not support ARM and does not support interworking (interworking = mixing of ARM mode code and thumb mode code). Your problem was unrelated to interworking, you are not interworking (note the pop {pc} at the end of the functions). Your problem was likely related to your assembly code.
EDIT2:
Changing the makefile to specify the cortex-m
00001008 <hello>:
1008: b508 push {r3, lr}
100a: f000 f805 bl 1018 <two>
100e: 3007 adds r0, #7
1010: bd08 pop {r3, pc}
1012: 46c0 nop ; (mov r8, r8)
00001014 <three>:
1014: 3003 adds r0, #3
1016: 4770 bx lr
00001018 <two>:
1018: b508 push {r3, lr}
101a: f7ff fffb bl 1014 <three>
101e: 3005 adds r0, #5
1020: bd08 pop {r3, pc}
1022: 46c0 nop ; (mov r8, r8)
first and foremost it is all thumb since there is no arm mode on a cortex-m, second the bx is not needed for function returns (Because there are no arm/thumb mode changes). So pop {pc} will work.
it is curious that the dummy register is still used on a push, I tried an arm7tdmi/armv4t build and it still did that, so there is some other flag to use to get rid of that behavior.
If your desire was to learn how to make an assembly function that you can call from C, you should have just done that. Make a C function that somewhat resembles the framework of the function you want to create in asm:
extern unsigned int Double ( unsigned int );
unsigned int Start ( void )
{
return(Double(42));
}
assemble then disassemble
00000000 <Start>:
0: b508 push {r3, lr}
2: 202a movs r0, #42 ; 0x2a
4: f7ff fffe bl 0 <Double>
8: bd08 pop {r3, pc}
a: 46c0 nop ; (mov r8, r8)
and start with that as you assembly function.
.globl Start
.thumb_func
Start:
push {lr}
mov r0, #42
bl Double
pop {pc}
That, or read the arm abi for gcc and understand what registers you can and cant use without saving them on the stack, what registers are used for passing and returning parameters.

_start function is the entry point of a C program which makes a call to main(). In order to debug _start function of any C program is in the assembly file. Actually the real entry point of a program on linux is not main(), but rather a function called _start(). The standard libraries normally provide a version of this that runs some initialization code, then calls main().
Try compiling this with gcc -nostdlib: