Objdump stops disassembling after label - arm

I'm designing an AArch64 application in assembly and C, using Linaro toolchain, and frequently use objdump to look at my own disassembled binary.
However, objdump does not decode part of file, and treats it as data. It always happens after a second label in source.
For example, this code:
.global _Reset
_Reset:
BL get_cpuid
CBNZ x0, inf_loop
LDR x0, =page_table_base
LDR x1, =0x0000000000000601
STR x1, [x0, #0x00]
test:
LDR x1, =0x0060000040000601
STR x1, [x0, #0x08]
...
Disassembles to
Disassembly of section .startup:
0000000000000000 <_Reset>:
0: 94000024 bl 90 <get_cpuid>
4: b50004c0 cbnz x0, 9c <inf_loop>
8: 58000880 ldr x0, 118 <TXTN+0x3>
c: 580008a1 ldr x1, 120 <TXTN+0xb>
10: f9000001 str x1, [x0]
0000000000000014 <test>:
14: 580008a1 .word 0x580008a1
18: f9000401 .word 0xf9000401
...
Why does this happen?

Related

How can I generate following arm assembler output using ARM gcc 7.3?

myfunction:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
mul r3, r0, r0
mov r0, r3
mla r0, r1, r0, r2
bx lr
I am able to generate everything except for the mov instruction using following C function.
int myfunction(int r0, int r1, int r2, int r3)
{
r3 = r0*r0;
r0 = r3;
r3 = r0;
return (r1*r3)+r2;
}
How can I instruct r3 to be set to the address of r0 in assembly code?
unsigned int myfunction(unsigned int a, unsigned int b, unsigned int c)
{
return (a*a*b)+c;
}
Your choices are going to be something like this
00000000 <myfunction>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd014 sub sp, sp, #20
c: e50b0008 str r0, [r11, #-8]
10: e50b100c str r1, [r11, #-12]
14: e50b2010 str r2, [r11, #-16]
18: e51b3008 ldr r3, [r11, #-8]
1c: e51b2008 ldr r2, [r11, #-8]
20: e0010392 mul r1, r2, r3
24: e51b200c ldr r2, [r11, #-12]
28: e0000291 mul r0, r1, r2
2c: e51b3010 ldr r3, [r11, #-16]
30: e0803003 add r3, r0, r3
34: e1a00003 mov r0, r3
38: e28bd000 add sp, r11, #0
3c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
40: e12fff1e bx lr
or this
00000000 <myfunction>:
0: e0030090 mul r3, r0, r0
4: e0202391 mla r0, r1, r3, r2
8: e12fff1e bx lr
as you have probably figured out.
The mov should never be considered by the compiler backend as it just wastes an instruction. r3 goes into the mla no need to put it in r0 then do the mla. Not quite sure how to get the compiler to do more. Even this doesn't encourage it
unsigned int fun ( unsigned int a )
{
return(a*a);
}
unsigned int myfunction(unsigned int a, unsigned int b, unsigned int c)
{
return (fun(a)*b)+c;
}
giving
00000000 <fun>:
0: e1a03000 mov r3, r0
4: e0000093 mul r0, r3, r0
8: e12fff1e bx lr
0000000c <myfunction>:
c: e0030090 mul r3, r0, r0
10: e0202391 mla r0, r1, r3, r2
14: e12fff1e bx lr
Basically if you don't optimize you get nowhere near what you were after. If you optimize that mov shouldn't be there, should be easy to optimize out.
While some level of manipulation of writing high level code to encourage the compiler to output low level code is possible, trying to get this exact output is not something you should expect to be able to do.
Unless you use inline asm
asm
(
"mul r3, r0, r0\n"
"mov r0, r3\n"
"mla r0, r1, r0, r2\n"
"bx lr\n"
);
giving your result
Disassembly of section .text:
00000000 <.text>:
0: e0030090 mul r3, r0, r0
4: e1a00003 mov r0, r3
8: e0202091 mla r0, r1, r0, r2
c: e12fff1e bx lr
or real asm
mul r3, r0, r0
mov r0, r3
mla r0, r1, r0, r2
bx lr
and feed it into gcc rather than as (arm-whatever-gcc so.s -o so.o)
Disassembly of section .text:
00000000 <.text>:
0: e0030090 mul r3, r0, r0
4: e1a00003 mov r0, r3
8: e0202091 mla r0, r1, r0, r2
c: e12fff1e bx lr
so that technically you were using gcc on the command line but gcc does some preprocessing and then feeds it to as.
Unless you find a core or where Rd and Rs have to be the same register and can then specify that core/bug/whatever on the gcc command line, I don't see the mov happening, maybe, just maybe, with clang/llvm compile fun and myfunction separately to bytecode then combine them then optimize then output to the target then examine that. I would hope either in the optimization or the output that the mov would be optimized out but you might get lucky.
Edit
I made an error:
unsigned int myfunction(unsigned int a, unsigned int b, unsigned int c)
{
return (a*a*b)+c;
}
arm-linux-gnueabi-gcc --version
arm-linux-gnueabi-gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Disassembly of section .text:
00000000 <myfunction>:
0: e0030090 mul r3, r0, r0
4: e1a00003 mov r0, r3
8: e0202091 mla r0, r1, r0, r2
c: e12fff1e bx lr
but this
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <myfunction>:
0: e0030090 mul r3, r0, r0
4: e0202391 mla r0, r1, r3, r2
8: e12fff1e bx lr
I'll have to build a 7.3 or go find one. Somewhere between 5.x.x and 8.x.x the backend changed or...
Note you may need -mcpu=arm7tdmi or -mcpu=arm9tdmi or -march=armv4t or -march=armv5t on the command line depending on the default target (cpu/arch) built into your compiler. Or you might get something like this
Disassembly of section .text:
00000000 <myfunction>:
0: fb00 f000 mul.w r0, r0, r0
4: fb01 2000 mla r0, r1, r0, r2
8: 4770 bx lr
a: bf00 nop
this
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
produces
Disassembly of section .text:
00000000 <myfunction>:
0: e0030090 mul r3, r0, r0
4: e0202391 mla r0, r1, r3, r2
8: e12fff1e bx lr
So you may have to work backward to find the version where it changed, the source code change to gcc that caused it and modify 7.3.0 making something that is not really 7.3.0 but reports as 7.3.0 and outputs your desired code.

Why do I get different result using log() function in C?

Here is a simple example of log() function test:
#include <stdio.h>
#include <math.h>
int main(void)
{
int a = 2;
printf("int a = %d, log((double)a) = %g, log(2.0) = %g\n", a, log((double)a), log(2.0));
return 0;
}
I get difference on Raspberry Pi 3 and Ubuntu16.04:
arm-linux-gnueabi-gcc
$ arm-linux-gnueabi-gcc -mfloat-abi=soft -march=armv7-a foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 5.23028e-314, log(2.0) = 0.693147
arm-linux-gnueabihf-gcc
$ arm-linux-gnueabihf-gcc -march=armv7-a foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 0.693147, log(2.0) = 0.693147
gcc
$ gcc foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 0.693147, log(2.0) = 0.693147
The standard distribution of Raspbian uses the hardware floating point support of the Raspberry Pi (Raspbian FAQ) which is not fully compatible with the different approach of using a software library to emulate floating point computation using integers only.
You can tell the type of your Raspbian distribution by looking for the directory /lib/arm-linux-gnueabihf for the hard-float version and /lib/arm-linux-gnueabi (How can I tell...) for the soft-float one.
As Pascal Cuoq noted in one of the comments to this question, it might be of interest to know that the reason for the correct result of log(2.0) in all examples is called constant folding. The compiler is allowed to compute every result at compile time—if possible—for optimization purposes. This might be an unwanted behaviour if you have for example different rounding modes in your code. GCC has -frounding-math to switch of constant folding (among other things), although it might not catch everything, so be careful here.
Not able to repeat the issue. Where is your disassembly to show the value fed to printf?
#include <math.h>
double fun1 ( void )
{
return(log(2));
}
double fun2 ( void )
{
return(log(2.0));
}
00000000 <fun1>:
0: e30309ef movw r0, #14831 ; 0x39ef
4: e3021e42 movw r1, #11842 ; 0x2e42
8: e34f0efa movt r0, #65274 ; 0xfefa
c: e3431fe6 movt r1, #16358 ; 0x3fe6
10: e12fff1e bx lr
00000014 <fun2>:
14: e30309ef movw r0, #14831 ; 0x39ef
18: e3021e42 movw r1, #11842 ; 0x2e42
1c: e34f0efa movt r0, #65274 ; 0xfefa
20: e3431fe6 movt r1, #16358 ; 0x3fe6
24: e12fff1e bx lr
00000000 <fun1>:
0: ed9f 0b01 vldr d0, [pc, #4] ; 8 <fun1+0x8>
4: 4770 bx lr
6: bf00
8: fefa39ef
c: 3fe62e42
00000010 <fun2>:
10: ed9f 0b01 vldr d0, [pc, #4] ; 18 <fun2+0x8>
14: 4770 bx lr
16: bf00
18: fefa39ef
1c: 3fe62e42
0000000000000000 <fun1>:
0: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 8 <fun1+0x8>
7: 00
8: c3 retq
0000000000000010 <fun2>:
10: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 18 <fun2+0x8>
17: 00
18: c3 retq
0000000000000000 <.LC0>:
0: ef
1: 39 fa
3: fe 42 2e
6: e6 3f
Now causing an int to float conversion vs building in the float version (2) vs (2.0) as well as adding in (2.0F). Compile time or runtime can cause differences.
Start by eliminating the printf, divide this problem in half, am I seeing some printf thing or not printf thing. then is this a compile time thing or is this a runtime thing, is this a hard float thing or a soft float thing. Is this a c library thing or not a C library thing.
What if anything have you done so far to debug this?
Eventually someone is going to link the "whatever programmer should know about floating point" whether it applies or not...
EDIT
#include <math.h>
double fun ( void )
{
return(log(2.0));
}
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: ec432b17 vmov d7, r2, r3
1c: eeb00b47 vmov.f64 d0, d7
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: e1a00002 mov r0, r2
1c: e1a01003 mov r1, r3
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
well there goes the notion of constant folding explaining why to calls to log() give vastly different results. (arguably a different version of the toolchain (or different command line arguments) you could just get lucky, so far we dont know what version of the toolchains, build options, etc were used to be able to repeat this).
EDIT 2
#include <math.h>
double fun ( void )
{
return(log(2));
}
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: ec432b17 vmov d7, r2, r3
1c: eeb00b47 vmov.f64 d0, d7
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: e1a00002 mov r0, r2
1c: e1a01003 mov r1, r3
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
around 60 seconds worth of work to contemplate constant folding maybe being a factor, so far it doesnt apply, but there is potential dumb luck there, but the same dumb luck could/would apply to both calls to log
A few seconds of work by the OP to disassemble that program would quickly cover this side topic.

Different Static Global Variables Share the Same Memory Address

Summary
I have several C source files that all declare individual identically named static global variables. My understanding is that the static global variable in each file should be visible only within that file and should not have external linkage applied, but in fact I can see when debugging that the identically named variables share the same memory address.
It is like the static keyword is being ignored and the global variables are being treated as extern instead. Why is this?
Example Code
foo.c:
/* Private variables -----------------------------------*/
static myEnumType myVar = VALUE_A;
/* Exported functions ----------------------------------*/
void someFooFunc(void) {
myVar = VALUE_B;
}
bar.c:
/* Private variables -----------------------------------*/
static myEnumType myVar = VALUE_A;
/* Exported functions ----------------------------------*/
void someBarFunc(void) {
myVar = VALUE_C;
}
baz.c:
/* Private variables -----------------------------------*/
static myEnumType myVar = VALUE_A;
/* Exported functions ----------------------------------*/
void someBazFunc(void) {
myVar = VALUE_D;
}
Debugging Observations
Set breakpoints on the myVar = ... line inside each function.
Call someFooFunc, someBarFunc, and someBazFunc in that order from main.
Inside someFooFunc myVar initially is set to VALUE_A, after stepping over the line it is set to VALUE_B.
Inside someBarFunc myVar is for some reason initally set to VALUE_B before stepping over the line, not VALUE_A as I'd expect, indicating the linker may have merged the separate global variables based on them having an identical name.
The same goes for someBazFunc when it is called.
If I use the debugger to evaluate the value of &myVar when at each breakpoint the same address is given.
Tools & Flags
Toolchain: GNU ARM GCC (6.2 2016q4)
Compiler options:
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mlong-calls -O1 -fmessage-length=0 -fsigned-char -ffunction-sections -fdata-sections -ffreestanding -fno-move-loop-invariants -Wall -Wextra -g3 -DDEBUG -DTRACE -DOS_USE_TRACE_ITM -DSTM32L476xx -I"../include" -I"../system/include" -I"../system/include/cmsis" -I"../system/include/stm32l4xx" -I"../system/include/cmsis/device" -I"../foo/inc" -std=gnu11 -MMD -MP -MF"foo/src/foo.d" -MT"foo/src/foo.o" -c -o "foo/src/foo.o" "../foo/src/foo.c"
Linker options:
arm-none-eabi-g++ -mcpu=cortex-m4 -mthumb -mlong-calls -O1 -fmessage-length=0 -fsigned-char -ffunction-sections -fdata-sections -ffreestanding -fno-move-loop-invariants -Wall -Wextra -g3 -T mem.ld -T libs.ld -T sections.ld -nostartfiles -Xlinker --gc-sections -L"../ldscripts" -Wl,-Map,"myProj.map" --specs=nano.specs -o ...
NOTE: I do understand that OP's target platform is ARM, but nevertheless I'm still posting an answer in terms of x86. The reason is, I have no ARM backend in handy, while the question is not limited to a particular architecture.
Here's a simple test stand. Note that I'm using int instead of custom enum typedef, since it should not matter at all.
foo.c
static int myVar = 1;
int someFooFunc(void)
{
myVar += 2;
return myVar;
}
bar.c
static int myVar = 1;
int someBarFunc(void)
{
myVar += 3;
return myVar;
}
main.c
#include <stdio.h>
int someFooFunc(void);
int someBarFunc(void);
int main(int argc, char* argv[])
{
printf("%d\n", someFooFunc());
printf("%d\n", someBarFunc());
return 0;
}
I'm compiling it on x86_64 Ubuntu 14.04 with GCC 4.8.4:
$ g++ main.c foo.c bar.c
$ ./a.out
3
4
Obtaining such results effectively means that myVar variables in foo.c and bar.c are different. If you look at the disassembly (by objdump -D ./a.out):
000000000040052d <_Z11someFooFuncv>:
40052d: 55 push %rbp
40052e: 48 89 e5 mov %rsp,%rbp
400531: 8b 05 09 0b 20 00 mov 0x200b09(%rip),%eax # 601040 <_ZL5myVar>
400537: 83 c0 02 add $0x2,%eax
40053a: 89 05 00 0b 20 00 mov %eax,0x200b00(%rip) # 601040 <_ZL5myVar>
400540: 8b 05 fa 0a 20 00 mov 0x200afa(%rip),%eax # 601040 <_ZL5myVar>
400546: 5d pop %rbp
400547: c3 retq
0000000000400548 <_Z11someBarFuncv>:
400548: 55 push %rbp
400549: 48 89 e5 mov %rsp,%rbp
40054c: 8b 05 f2 0a 20 00 mov 0x200af2(%rip),%eax # 601044 <_ZL5myVar>
400552: 83 c0 03 add $0x3,%eax
400555: 89 05 e9 0a 20 00 mov %eax,0x200ae9(%rip) # 601044 <_ZL5myVar>
40055b: 8b 05 e3 0a 20 00 mov 0x200ae3(%rip),%eax # 601044 <_ZL5myVar>
400561: 5d pop %rbp
400562: c3 retq
You can see that the actual addresses of static variables in different modules are indeed different: 0x601040 for foo.c and 0x601044 for bar.c. However, they are associated with a single symbol _ZL5myVar, which really screws up GDB logic.
You can double-check that by means of objdump -t ./a.out:
0000000000601040 l O .data 0000000000000004 _ZL5myVar
0000000000601044 l O .data 0000000000000004 _ZL5myVar
Yet again, different addresses, same symbols. How GDB will resolve this conflict is purely implementation-dependent.
I strongly believe that it's your case as well. However, to be double sure, you might want to try these steps in your environment.
so.s make the linker happy
.globl _start
_start: b _start
one.c
static unsigned int hello = 4;
static unsigned int one = 5;
void fun1 ( void )
{
hello=5;
one=6;
}
two.c
static unsigned int hello = 4;
static unsigned int two = 5;
void fun2 ( void )
{
hello=5;
two=6;
}
three.c
static unsigned int hello = 4;
static unsigned int three = 5;
void fun3 ( void )
{
hello=5;
three=6;
}
first off if you optimize then this is completely dead code and you should not expect to see any of these variables. The functions are not static so they dont disappear:
Disassembly of section .text:
08000000 <_start>:
8000000: eafffffe b 8000000 <_start>
08000004 <fun1>:
8000004: e12fff1e bx lr
08000008 <fun2>:
8000008: e12fff1e bx lr
0800000c <fun3>:
800000c: e12fff1e bx lr
If you dont optimize then
08000000 <_start>:
8000000: eafffffe b 8000000 <_start>
08000004 <fun1>:
8000004: e52db004 push {r11} ; (str r11, [sp, #-4]!)
8000008: e28db000 add r11, sp, #0
800000c: e59f3020 ldr r3, [pc, #32] ; 8000034 <fun1+0x30>
8000010: e3a02005 mov r2, #5
8000014: e5832000 str r2, [r3]
8000018: e59f3018 ldr r3, [pc, #24] ; 8000038 <fun1+0x34>
800001c: e3a02006 mov r2, #6
8000020: e5832000 str r2, [r3]
8000024: e1a00000 nop ; (mov r0, r0)
8000028: e28bd000 add sp, r11, #0
800002c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
8000030: e12fff1e bx lr
8000034: 20000000 andcs r0, r0, r0
8000038: 20000004 andcs r0, r0, r4
0800003c <fun2>:
800003c: e52db004 push {r11} ; (str r11, [sp, #-4]!)
8000040: e28db000 add r11, sp, #0
8000044: e59f3020 ldr r3, [pc, #32] ; 800006c <fun2+0x30>
8000048: e3a02005 mov r2, #5
800004c: e5832000 str r2, [r3]
8000050: e59f3018 ldr r3, [pc, #24] ; 8000070 <fun2+0x34>
8000054: e3a02006 mov r2, #6
8000058: e5832000 str r2, [r3]
800005c: e1a00000 nop ; (mov r0, r0)
8000060: e28bd000 add sp, r11, #0
8000064: e49db004 pop {r11} ; (ldr r11, [sp], #4)
8000068: e12fff1e bx lr
800006c: 20000008 andcs r0, r0, r8
8000070: 2000000c andcs r0, r0, r12
08000074 <fun3>:
8000074: e52db004 push {r11} ; (str r11, [sp, #-4]!)
8000078: e28db000 add r11, sp, #0
800007c: e59f3020 ldr r3, [pc, #32] ; 80000a4 <fun3+0x30>
8000080: e3a02005 mov r2, #5
8000084: e5832000 str r2, [r3]
8000088: e59f3018 ldr r3, [pc, #24] ; 80000a8 <fun3+0x34>
800008c: e3a02006 mov r2, #6
8000090: e5832000 str r2, [r3]
8000094: e1a00000 nop ; (mov r0, r0)
8000098: e28bd000 add sp, r11, #0
800009c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
80000a0: e12fff1e bx lr
80000a4: 20000010 andcs r0, r0, r0, lsl r0
80000a8: 20000014 andcs r0, r0, r4, lsl r0
Disassembly of section .data:
20000000 <hello>:
20000000: 00000004 andeq r0, r0, r4
20000004 <one>:
20000004: 00000005 andeq r0, r0, r5
20000008 <hello>:
20000008: 00000004 andeq r0, r0, r4
2000000c <two>:
2000000c: 00000005 andeq r0, r0, r5
20000010 <hello>:
20000010: 00000004 andeq r0, r0, r4
there are three hello variables created (you should notice by now that there is no reason to start up the debugger this can all be answered by simply examining the compiler and linker output, the debugger just gets in the way)
800000c: e59f3020 ldr r3, [pc, #32] ; 8000034 <fun1+0x30>
8000034: 20000000 andcs r0, r0, r0
8000044: e59f3020 ldr r3, [pc, #32] ; 800006c <fun2+0x30>
800006c: 20000008 andcs r0, r0, r8
800007c: e59f3020 ldr r3, [pc, #32] ; 80000a4 <fun3+0x30>
80000a4: 20000010 andcs r0, r0, r0, lsl r0
20000000 <hello>:
20000000: 00000004 andeq r0, r0, r4
20000008 <hello>:
20000008: 00000004 andeq r0, r0, r4
20000010 <hello>:
20000010: 00000004 andeq r0, r0, r4
each function is accessing its own separate version of the static global. They are not combined into one shared global.
The answers thus far have demonstrated that it should work as written, but the actual answer is only in the comments so I will post it as an answer.
What you’re seeing is a debugger artifact, not the real situation. In my experience, this should be your first guess of any truely wierd observation within the debugger. Verify the observation in the actual running program before going on. E.g. an old fashioned debug printf statement.

How to make bare metal ARM programs and run them on QEMU?

I am trying to get this tutorial to work as intended without success (Something fails after the bl main instruction).
According to the tutorial the command
(qemu) xp /1dw 0xa0000018
should result in the print 33 (But i get 0x00 instead)
a0000018: 33
This is the content of the registers after the main call (see startup.s)
(qemu) info registers
R00=a000001c R01=a000001c R02=00000006 R03=00000000
R04=00000000 R05=00000005 R06=00000006 R07=00000007
R08=00000008 R09=00000009 R10=00000000 R11=a3fffffc
R12=00000000 R13=00000000 R14=0000003c R15=00000004
PSR=800001db N--- A und32
FPSCR: 00000000
I have the following files
main.c
startup.s
lscript.ld
Makefile
And I am using the following toolchain
arm-2013.11-24-arm-none-eabi-i686-pc-linux-gnu
Makefile:
SRCS := main.c startup.s
LINKER_NAME := lscript.ld
ELF_NAME := program.elf
BIN_NAME := program.bin
FLASH_NAME := flash.bin
CC := arm-none-eabi
CFLAGS := -nostdlib
OBJFLAGS ?= -DS
QEMUFLAGS := -M connex -pflash $(FLASH_NAME) -nographic -serial /dev/null
# Allocate 16MB to use as a virtual flash for th qemu
# bs = blocksize -> 4KB
# count = number of block -> 4096
# totalsize = 16MB
setup:
dd if=/dev/zero of=$(FLASH_NAME) bs=4096 count=4096
# Compile srcs and write to virtual flash
all: clean setup
$(CC)-gcc $(CFLAGS) -o $(ELF_NAME) -T $(LINKER_NAME) $(SRCS)
$(CC)-objcopy -O binary $(ELF_NAME) $(BIN_NAME)
dd if=$(BIN_NAME) of=$(FLASH_NAME) bs=4096 conv=notrunc
objdump:
$(CC)-objdump $(OBJFLAGS) $(ELF_NAME)
mem-placement:
$(CC)-nm -n $(ELF_NAME)
qemu:
qemu-system-arm $(QEMUFLAGS)
clean:
rm -rf *.bin
rm -rf *.elf
main.c:
static int arr[] = { 1, 10, 4, 5, 6, 7 };
static int sum;
static const int n = sizeof(arr) / sizeof(arr[0]);
int main()
{
int i;
for (i = 0; i < n; i++){
sum += arr[i];
}
return 0;
}
startup.s:
.section "vectors"
reset: b _start
undef: b undef
swi: b swi
pabt: b pabt
dabt: b dabt
nop
irq: b irq
fiq: b fiq
.text
_start:
init:
## Copy data to RAM.
ldr r0, =flash_sdata
ldr r1, =ram_sdata
ldr r2, =data_size
## Handle data_size == 0
cmp r2, #0
beq init_bss
copy:
ldrb r4, [r0], #1
strb r4, [r1], #1
subs r2, r2, #1
bne copy
init_bss:
## Initialize .bss
ldr r0, =sbss
ldr r1, =ebss
ldr r2, =bss_size
## Handle bss_size == 0
cmp r2, #0
beq init_stack
mov r4, #0
zero:
strb r4, [r0], #1
subs r2, r2, #1
bne zero
init_stack:
## Initialize the stack pointer
ldr sp, =0xA4000000
## **this call dosent work as expected.. (r13/sp contains 0xA4000000)**
bl main
## Dosent return from main
## r0 should now contain 33
stop:
b stop
lscript.ld:
/*
* Linker for testing purposes
* (using 16 MB virtual flash = 0x0100_0000)
*/
MEMORY {
rom (rx) : ORIGIN = 0x00000000, LENGTH = 0x01000000
ram (rwx) : ORIGIN = 0xA0000000, LENGTH = 0x04000000
}
SECTIONS {
.text : {
* (vectors);
* (.text);
} > rom
.rodata : {
* (.rodata);
} > rom
flash_sdata = .;
ram_sdata = ORIGIN(ram);
.data : AT (flash_sdata) {
* (.data);
} > ram
ram_edata = .;
data_size = ram_edata - ram_sdata;
sbss = .;
.bss : {
* (.bss);
} > ram
ebss = .;
bss_size = ebss - sbss;
/DISCARD/ : {
*(.note*)
*(.comment)
*(.ARM*)
/*
*(.debug*)
*/
}
}
Disassembly of the executable (objdump):
program.elf: file format elf32-littlearm
Disassembly of section .text:
00000000 <reset>:
0: ea000023 b 94 <_start>
00000004 <undef>:
4: eafffffe b 4 <undef>
00000008 <swi>:
8: eafffffe b 8 <swi>
0000000c <pabt>:
c: eafffffe b c <pabt>
00000010 <dabt>:
10: eafffffe b 10 <dabt>
14: e320f000 nop {0}
00000018 <irq>:
18: eafffffe b 18 <irq>
0000001c <fiq>:
1c: eafffffe b 1c <fiq>
00000020 <main>:
20: e52db004 push {fp} ; (str fp, [sp, #-4]!)
24: e28db000 add fp, sp, #0
28: e24dd00c sub sp, sp, #12
2c: e3a03000 mov r3, #0
30: e50b3008 str r3, [fp, #-8]
34: ea00000d b 70 <main+0x50>
38: e3003000 movw r3, #0
3c: e34a3000 movt r3, #40960 ; 0xa000
40: e51b2008 ldr r2, [fp, #-8]
44: e7932102 ldr r2, [r3, r2, lsl #2]
48: e3003018 movw r3, #24
4c: e34a3000 movt r3, #40960 ; 0xa000
50: e5933000 ldr r3, [r3]
54: e0822003 add r2, r2, r3
58: e3003018 movw r3, #24
5c: e34a3000 movt r3, #40960 ; 0xa000
60: e5832000 str r2, [r3]
64: e51b3008 ldr r3, [fp, #-8]
68: e2833001 add r3, r3, #1
6c: e50b3008 str r3, [fp, #-8]
70: e3a02006 mov r2, #6
74: e51b3008 ldr r3, [fp, #-8]
78: e1530002 cmp r3, r2
7c: baffffed blt 38 <main+0x18>
80: e3a03000 mov r3, #0
84: e1a00003 mov r0, r3
88: e24bd000 sub sp, fp, #0
8c: e49db004 pop {fp} ; (ldr fp, [sp], #4)
90: e12fff1e bx lr
00000094 <_start>:
94: e59f004c ldr r0, [pc, #76] ; e8 <stop+0x4>
98: e59f104c ldr r1, [pc, #76] ; ec <stop+0x8>
9c: e59f204c ldr r2, [pc, #76] ; f0 <stop+0xc>
a0: e3520000 cmp r2, #0
a4: 0a000003 beq b8 <init_bss>
000000a8 <copy>:
a8: e4d04001 ldrb r4, [r0], #1
ac: e4c14001 strb r4, [r1], #1
b0: e2522001 subs r2, r2, #1
b4: 1afffffb bne a8 <copy>
000000b8 <init_bss>:
b8: e59f0034 ldr r0, [pc, #52] ; f4 <stop+0x10>
bc: e59f1034 ldr r1, [pc, #52] ; f8 <stop+0x14>
c0: e59f2034 ldr r2, [pc, #52] ; fc <stop+0x18>
c4: e3520000 cmp r2, #0
c8: 0a000003 beq dc <init_stack>
cc: e3a04000 mov r4, #0
000000d0 <zero>:
d0: e4c04001 strb r4, [r0], #1
d4: e2522001 subs r2, r2, #1
d8: 1afffffc bne d0 <zero>
000000dc <init_stack>:
dc: e3a0d329 mov sp, #-1543503872 ; 0xa4000000
e0: ebffffce bl 20 <main>
000000e4 <stop>:
e4: eafffffe b e4 <stop>
e8: 00000104 andeq r0, r0, r4, lsl #2
ec: a0000000 andge r0, r0, r0
f0: 00000018 andeq r0, r0, r8, lsl r0
f4: a0000018 andge r0, r0, r8, lsl r0
f8: a000001c andge r0, r0, ip, lsl r0
fc: 00000004 andeq r0, r0, r4
Disassembly of section .rodata:
00000100 <n>:
100: 00000006 andeq r0, r0, r6
Disassembly of section .data:
a0000000 <arr>:
a0000000: 00000001 andeq r0, r0, r1
a0000004: 0000000a andeq r0, r0, sl
a0000008: 00000004 andeq r0, r0, r4
a000000c: 00000005 andeq r0, r0, r5
a0000010: 00000006 andeq r0, r0, r6
a0000014: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
a0000018 <sum>:
a0000018: 00000000 andeq r0, r0, r0
Can someone point me in the right direction to why this isn't working according to my expectations?
Thanks Henrik
Minimal examples that just work
https://github.com/cirosantilli/linux-kernel-module-cheat/tree/54e15e04338c0fecc0be139a0da2d0d972c21419#baremetal-setup-getting-started
The prompt.c example takes input from your host terminal and gives back output all through the simulated UART:
enter a character
got: a
new alloc of 1 bytes at address 0x0x4000a1c0
enter a character
got: b
new alloc of 2 bytes at address 0x0x4000a1c0
enter a character
It uses Newlib to expose a subset of the C standard library. This allows you to run existing programs written in C if the only use that restricted subset of the C standard library.
More details about Newlib at: https://electronics.stackexchange.com/questions/223929/c-standard-libraries-on-bare-metal/400077#400077
https://github.com/freedomtan/aarch64-bare-metal-qemu/tree/2ae937a2b106b43bfca49eec49359b3e30eac1b1 for -M virt, just the hello world on the repo. Compile with:
sudo apt-get install gcc-aarch64-linux-gnu
make CROSS_PREFIX=aarch64-linux-gnu-
Here is the example minimized to printing a single character from assembly: How to run a bare metal ELF file on QEMU?
https://github.com/bztsrc/raspi3-tutorial for -M raspi3. Quick getting started at: https://raspberrypi.stackexchange.com/questions/34733/how-to-do-qemu-emulation-for-bare-metal-raspberry-pi-images/85135#85135 Several other examples on the repo going to more advanced subjects.
Also does display output on 09_framebuffer.
Both write a hello world to the UART.
Tested in Ubuntu 18.04, gcc-aarch64-linux-gnu version 4:7.3.0-3ubuntu2.
Debugging!
First, look at the PC and PSR: You're in Undef mode, in the undefined instruction handler.
OK, in an exception mode, the LR tells you where you took the exception. There are some slightly complicated rules between the PC offset and the preferred return address determining exactly what it points at, but just eyeballing it it's clearly in the vicinity of the movw/movt pair.
The movw instruction effectively only exists in the ARMv7 ISA onwards. A brief investigation tells me the machine you're emulating is some old PXA255 thing, whose CPU only implements the ARMv5 ISA. Thus it's not surprising it faults on an instruction that it predates by many years.
Your compiler is apparently configured to target ARMv7 by default (which is not uncommon), so you need to add at least -march=armv5te to your CFLAGS to target the appropriate architecture version. The 'advanced' challenge would be to switch to a different, newer, machine, but that's going to involve adapting the linker script to a new memory map and rewriting any hardware-touching code for new peripherals, so I'd save that idea for the longer term, once you're comfortable with the basics of bare-metal code and slogging through hardware reference manuals.
for the same code on my ubuntu i got
arm-none-eabi-gcc -nostdlib -o sum.elf sum.lds startup.s -w
/usr/lib/gcc/arm-none-eabi/4.9.3/../../../arm-none-eabi/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000
/tmp/ccBthV7t.o: In function init_stack':
(.text+0x4c): undefined reference tomain'
collect2: error: ld returned 1 exit status

Would Thumb-2 ARM-Core Micros From Different Manufacturers Have Same Codesize?

Comparing two Thumb-2 micros from two different manufacturers. One's a Cortex M3, one's an A5. Are they guaranteed to compile a particular piece of code to the same codesize?
so here goes
fun.c
unsigned int fun ( unsigned int x )
{
return(x);
}
addimm.c
extern unsigned int fun ( unsigned int );
unsigned int addimm ( unsigned int x )
{
return(fun(x)+0x123);
}
for demonstration purposes building for bare metal, not really a functional program but it compiles clean and demonstrates what I intend to demonstrate.
arm instructions
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma.o
disassembly of the object, not linked
00000000 <addimm>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <fun>
8: e2800e12 add r0, r0, #288 ; 0x120
c: e2800003 add r0, r0, #3
10: e8bd8008 pop {r3, pc}
thumb generic (armv4 or v5 whatever the default was for this compiler build)
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -c addimm.c -o addimmt.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: 3024 adds r0, #36 ; 0x24
8: 30ff adds r0, #255 ; 0xff
a: bc08 pop {r3}
c: bc02 pop {r1}
e: 4708 bx r1
cortex-a5 specific
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma5.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
cortex-a5 is armv7-a which supports thumb-2 as far as the add immediate itself goes and related to binary size there is no optimization here, 32 bits for thumb and 32 bits for thumb2. But this is but one example there perhaps will be times that thumb2 produces smaller binaries than thumb.
cortex-m3
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m3 -march=armv7-m -c addimm.c -o addimmm3.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
produced the same result as cortex-a5. for this simple example the machine code for this object is the same, same size, when built for cortex-a5 and cortex-m3
Now if I add a bootstrap, a main, and call this function and fill in the function it calls to create a complete, linked, program
00000000 <_start>:
0: f000 f802 bl 8 <notmain>
4: e7fe b.n 4 <_start+0x4>
...
00000008 <notmain>:
8: 2005 movs r0, #5
a: f000 b801 b.w 10 <addimm>
e: bf00 nop
00000010 <addimm>:
10: b508 push {r3, lr}
12: f000 f803 bl 1c <fun>
16: f200 1023 addw r0, r0, #291 ; 0x123
1a: bd08 pop {r3, pc}
0000001c <fun>:
1c: 4770 bx lr
1e: 46c0 nop ; (mov r8, r8)
We get a result. The addimm function itself did not change in size. with a cortex-a5 you have to have some arm code that then switches to thumb, and likely when linking with libraries, etc you may get a mixture of arm and thumb, so
00000000 <_start>:
0: eb000000 bl 8 <notmain>
4: eafffffe b 4 <_start+0x4>
00000008 <notmain>:
8: e92d4008 push {r3, lr}
c: e3a00005 mov r0, #5
10: fa000001 blx 1c <addimm>
14: e8bd4008 pop {r3, lr}
18: e12fff1e bx lr
0000001c <addimm>:
1c: b508 push {r3, lr}
1e: f000 e804 blx 28 <fun>
22: f200 1023 addw r0, r0, #291 ; 0x123
26: bd08 pop {r3, pc}
00000028 <fun>:
28: e12fff1e bx lr
overall larger binary, the addimm part itself did not change in size though.
as far as linking changing the size of the object, look at this example
bootstrap.s
.thumb
.thumb_func
.globl _start
_start:
bl notmain
hang: b hang
.thumb_func
.globl dummy
dummy:
bx lr
.code 32
.globl bounce
bounce:
bx lr
hello.c
void dummy ( void );
void bounce ( void );
void notmain ( void )
{
dummy();
bounce();
}
looking at an arm build of notmain by itself, the object:
00000000 <notmain>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: ebfffffe bl 0 <dummy>
c: ebfffffe bl 0 <bounce>
10: e24bd004 sub sp, fp, #4
14: e8bd4800 pop {fp, lr}
18: e12fff1e bx lr
depending on what is calling it and what it calls, the linker may have to add more code to deal with items that are defined outside the object, from global variables to external functions
00008000 <_start>:
8000: f000 f818 bl 8034 <__notmain_from_thumb>
00008004 <hang>:
8004: e7fe b.n 8004 <hang>
00008006 <dummy>:
8006: 4770 bx lr
00008008 <bounce>:
8008: e12fff1e bx lr
0000800c <notmain>:
800c: e92d4800 push {fp, lr}
8010: e28db004 add fp, sp, #4
8014: eb000003 bl 8028 <__dummy_from_arm>
8018: ebfffffa bl 8008 <bounce>
801c: e24bd004 sub sp, fp, #4
8020: e8bd4800 pop {fp, lr}
8024: e12fff1e bx lr
00008028 <__dummy_from_arm>:
8028: e59fc000 ldr ip, [pc] ; 8030 <__dummy_from_arm+0x8>
802c: e12fff1c bx ip
8030: 00008007 andeq r8, r0, r7
00008034 <__notmain_from_thumb>:
8034: 4778 bx pc
8036: 46c0 nop ; (mov r8, r8)
8038: eafffff3 b 800c <notmain>
803c: 00000000 andeq r0, r0, r0
dummy_from_arm and notmain_from_thumb were both added, an increase in the size of the binary. each object did not change in size but the overall binary did. bounce() was an arm to arm function, no patching, dummy() arm to thumb and notmain() thumb to main.
so you might have a cortex-m3 object, and a cortex-a5 object that as far as the code in that object goes they are both identical. But dopending on what you link them with, which eventually something is dfferent between a cortex-m3 system and a cortex-a5 system, you may see more or less code added by the linker to account for the system differences, libraries, operating system specific, etc even so much as where in the binary you put the object, if it has to have a further reach than it can with a single instruction, then the linker will add even more code.
This is all gcc specific stuff, each toolchain is going to deal with each of these problems in its own way. It is the nature of the beast when you use an object and linker model, a very good model but the compiler, assembler, and linker have to work together to make sure that global resources can be properly accessed when linked. has nothing to do with ARM, this problem exists with many/most processor architectures and the toolchains deal with those problems per toolchain, per version, per target architecture. When I said change the size of the object what I really meant was the linker may add more code to the final binary in order to deal with that object and how it interacts with others.

Resources