Number of instructions used ARMv7 - arm

I am trying to figure out how many CPU cycles will be used to execute the delay function
delay:
subs r0, #1
bmi end_delay
b delay
end_delay:
bx lr
I feel intuitively that 1 CPU cycle should be used for each instruction, so if we began with r0 =4 it would take 11 CPU cycles to complete the following code is that correct ?

The cortex-m is not the same as a microchip pic chip, (or z80 and some others) you cannot create a predictable delay this way with this instruction set. You can insure it will be at OR SLOWER but not right at some amount of time (clocks).
0000009c <hello>:
9c: 3801 subs r0, #1
9e: d1fd bne.n 9c <hello>
your loop has a branch decision in there, more instructions and more paths basically so the opportunity for execution time to vary gets worse.
00000090 <delay>:
90: 3801 subs r0, #1
92: d400 bmi.n 96 <end_delay>
94: e7fc b.n 90 <delay>
00000096 <end_delay>:
so if we focus on these three instructions.
some cortex-ms have a build (of the logic) time option of fetching per instruction or per word, the cortex-m4 documentation says:
All fetches are word-wide.
so we hope that halfword alignment wont affect performance. with these instructions we dont necessarily expect to see the difference anyway. with a full sized arm the fetches are multiple words so you will definitely see fetch line (size) affects.
The execution depends heavily on the implementation. The cortex-m is just the arm core, the rest of the chip is from the chip vendor, purchased IP or built in house or a combination (very likely the latter). ARM does not make chips (other than perhaps for validation) they make IP that they sell.
The chip vendor determines the flash (and ram) implementation, often with these types of chips the flash speed is at or slower than the cpu speed, meaning it can take two clocks to fetch one instruction which means you never feed the cpu as fast as it can go. Some like ST have a cache they put in that you cannot (so far as I know) turn off, so it is hard to see this effect (but still possible), the particular chip I am using for this says:
8.2.3.1 Prefetch Buffer The Flash memory controller has a prefetch buffer that is automatically used when the CPU frequency is greater
than 40 MHz. In this mode, the Flash memory operates at half of the
system clock. The prefetch buffer fetches two 32-bit words per clock
allowing instructions to be fetched with no wait states while code is
executing linearly. The fetch buffer includes a branch speculation
mechanism that recognizes a branch and avoids extra wait states by not
reading the next word pair. Also, short loop branches often stay in
the buffer. As a result, some branches can be executed with no wait
states. Other branches incur a single wait state.
and of course like ST they dont really tell you the whole story. So we just go in and try this. You can use debug timers if you want but the systick runs off the same clock and gives you the same result
00000086 <test>:
86: f3bf 8f4f dsb sy
8a: f3bf 8f6f isb sy
8e: 680a ldr r2, [r1, #0]
00000090 <delay>:
90: 3801 subs r0, #1
92: d400 bmi.n 96 <end_delay>
94: e7fc b.n 90 <delay>
00000096 <end_delay>:
96: 680b ldr r3, [r1, #0]
98: 1ad0 subs r0, r2, r3
9a: 4770 bx lr
So I read the CCR and CPUID
00000200 CCR
410FC241 CPUID
just because. then ran the code under test three times
00000015
00000015
00000015
these numbers are in hex so that is 21 instructions. same execution time each time so no cache or branch prediction cache effects. I didnt see anything related to branch prediction on the cortex-m4 others cortex-ms do have branch prediciton (maybe only the m7). I have the I and D cache off, they will of course, along with alignment greatly effect the execution time (and that time can/will vary as your application runs).
I changed the alignment (add or remove nops in front of this code)
0000008a <delay>:
8a: 3801 subs r0, #1
8c: d400 bmi.n 90 <end_delay>
8e: e7fc b.n 8a <delay>
and it didnt affect the execution time.
AFAIK with this processor we cannot change the flash wait state settings directly it is automatic based on clock settings, so running at a different clock speed, above the 40Mhz mark I get
0000001E
0000001E
0000001E
For the same machine code, same alignment 30 clocks now instead of 21.
Normally the ram is faster and no wait state (understand these busses take several clocks per transaction, so it is not like the old days, but there is still a delay you can detect), so running these instructions in ram should tell us something
for(rb=0;rb<0x20;rb+=2)
{
hexstrings(rb);
ra=0x20001000+rb;
PUT16(ra,0x680a); ra+=2;
hexstrings(ra);
PUT16(ra,0x3801); ra+=2;
PUT16(ra,0xd400); ra+=2;
PUT16(ra,0xe7fc); ra+=2;
PUT16(ra,0x680b); ra+=2;
PUT16(ra,0x1ad0); ra+=2;
PUT16(ra,0x4770); ra+=2;
PUT16(ra,0x46c0); ra+=2;
PUT16(ra,0x46c0); ra+=2;
PUT16(ra,0x46c0); ra+=2;
PUT16(ra,0x46c0); ra+=2;
PUT16(ra,0x46c0); ra+=2;
PUT16(ra,0x46c0); ra+=2;
hexstring(BRANCHTO(4,STCURRENT,0x20001001+rb)&STMASK);
}
and that certainly gets interesting...
00000000 20001002 00000026
00000002 20001004 00000020
00000004 20001006 00000026
00000006 20001008 00000020
00000008 2000100A 00000026
0000000A 2000100C 00000020
0000000C 2000100E 00000026
0000000E 20001010 00000020
00000010 20001012 00000026
00000012 20001014 00000020
00000014 20001016 00000026
00000016 20001018 00000020
00000018 2000101A 00000026
0000001A 2000101C 00000020
0000001C 2000101E 00000026
0000001E 20001020 00000020
first off it is 32 or 38 clocks, second is there is an alignment effect
The armv7-m CCR shows a branch prediction bit, but the trm and the vendor documentation dont show it, so it could be a generic thing that not all cores support.
So for a specific cortex-m4 chip the time to execute your loop is between 21 and 38 clocks, and I could probably make it slower if I wanted to. I dont think I could get it down to 11 on this chip though.
If you are for example doing i2c bit banging you can use something like this for a delay that will work fine, wont be optimal but will work just fine. If you need something more precise within a window of time at least this but not greater than than then use a timer (and understand polled or interrupt your accuracy will have some error) if the timer peripheral or other can generate the signal you want you can then get down to a clock accurate waveform (if that is what your delay is for).
another cortex-m4 is expected to have different results, I would expect an stm32 to have the sram be same as or faster than flash, not slower as in this case. And there are settings you can mess with that your init code if you are relying on someone else to setup your chip, that can/will affect execution time.
EDIT
I dont know where I got the idea this was for a cortex-m4 which is an armv7-m, so I didnt have a raspberry pi 2 handy, but had a pi3, and running in aarch32 mode, 32 bit instructions. I had no idea how much work this would be to get the timers running and then the cache enabled. The pi runs out of dram which is very inconsistent even with bare metal. So I figured I would enable the l1 cache, and after the first run it should be all in cache and consistent. Now that I think about it there are four cores and each is running, dont know how to disable them the other three are spinning in a loop waiting for a mailbox register to tell them what code to run. perhaps I need to have them branch somewhere and run out of l1 cache as well...not sure if the l1 is per core or shared, I think I looked that up at one point.
Anyway code under test
000080c8 <COUNTER>:
80c8: ee192f1d mrc 15, 0, r2, cr9, cr13, {0}
000080cc <delay>:
80cc: e2500001 subs r0, r0, #1
80d0: 4a000000 bmi 80d8 <end_delay>
80d4: eafffffc b 80cc <delay>
000080d8 <end_delay>:
80d8: ee193f1d mrc 15, 0, r3, cr9, cr13, {0}
80dc: e0430002 sub r0, r3, r2
80e0: e12fff1e bx lr
and the punch line is for that alignment the first column is the r0 passed, the next three are three runs, the last column if there is the delta from the prior run to current (the cost of an extra count value in r0)
00000000 0000000A 0000000A 0000000A
00000001 00000014 00000014 00000014 0000000A
00000002 0000001E 0000001E 0000001E 0000000A
00000003 00000028 00000028 00000028 0000000A
00000004 00000032 00000032 00000032 0000000A
00000005 0000003C 0000003C 0000003C 0000000A
00000006 00000046 00000046 00000046 0000000A
00000007 00000050 00000050 00000050 0000000A
00000008 0000005A 0000005A 0000005A 0000000A
00000009 00000064 00000064 00000064 0000000A
0000000A 0000006E 0000006E 0000006E 0000000A
0000000B 00000078 00000078 00000078 0000000A
0000000C 00000082 00000082 00000082 0000000A
0000000D 0000008C 0000008C 0000008C 0000000A
0000000E 00000096 00000096 00000096 0000000A
0000000F 000000A0 000000A0 000000A0 0000000A
00000010 000000AA 000000AA 000000AA 0000000A
00000011 000000B4 000000B4 000000B4 0000000A
00000012 000000BE 000000BE 000000BE 0000000A
00000013 000000C8 000000C8 000000C8 0000000A
then to make alignment checking easier which I didnt need to do in the end
had it try different alignments for the above code (address in first column) and the results for a r0 of four.
00010000 00000032
00010004 0000002D
00010008 00000032
0001000C 0000002D
this repeats up to address 0x101FC
If I change the alignment in the compiled test
000080cc <COUNTER>:
80cc: ee192f1d mrc 15, 0, r2, cr9, cr13, {0}
000080d0 <delay>:
80d0: e2500001 subs r0, r0, #1
80d4: 4a000000 bmi 80dc <end_delay>
80d8: eafffffc b 80d0 <delay>
000080dc <end_delay>:
80dc: ee193f1d mrc 15, 0, r3, cr9, cr13, {0}
80e0: e0430002 sub r0, r3, r2
80e4: e12fff1e bx lr
then it is a wee bit faster.
00000000 00000009 00000009 00000009
00000001 00000012 00000012 00000012 00000009
00000002 0000001B 0000001B 0000001B 00000009
00000003 00000024 00000024 00000024 00000009
00000004 0000002D 0000002D 0000002D 00000009
00000005 00000036 00000036 00000036 00000009
00000006 0000003F 0000003F 0000003F 00000009
00000007 00000048 00000048 00000048 00000009
00000008 00000051 00000051 00000051 00000009
00000009 0000005A 0000005A 0000005A 00000009
0000000A 00000063 00000063 00000063 00000009
0000000B 0000006C 0000006C 0000006C 00000009
0000000C 00000075 00000075 00000075 00000009
0000000D 0000007E 0000007E 0000007E 00000009
0000000E 00000087 00000087 00000087 00000009
0000000F 00000090 00000090 00000090 00000009
00000010 00000099 00000099 00000099 00000009
00000011 000000A2 000000A2 000000A2 00000009
00000012 000000AB 000000AB 000000AB 00000009
00000013 000000B4 000000B4 000000B4 00000009
if I change it to be a function call
000080cc <COUNTER>:
80cc: e92d4001 push {r0, lr}
80d0: ee192f1d mrc 15, 0, r2, cr9, cr13, {0}
80d4: eb000003 bl 80e8 <delay>
80d8: ee193f1d mrc 15, 0, r3, cr9, cr13, {0}
80dc: e8bd4001 pop {r0, lr}
80e0: e0430002 sub r0, r3, r2
80e4: e12fff1e bx lr
000080e8 <delay>:
80e8: e2500001 subs r0, r0, #1
80ec: 4a000000 bmi 80f4 <end_delay>
80f0: eafffffc b 80e8 <delay>
000080f4 <end_delay>:
80f4: e12fff1e bx lr
00000000 0000001A 0000001A 0000001A
00000001 00000023 00000023 00000023 00000009
00000002 0000002C 0000002C 0000002C 00000009
00000003 00000035 00000035 00000035 00000009
00000004 0000003E 0000003E 0000003E 00000009
00000005 00000047 00000047 00000047 00000009
00000006 00000050 00000050 00000050 00000009
00000007 00000059 00000059 00000059 00000009
00000008 00000062 00000062 00000062 00000009
00000009 0000006B 0000006B 0000006B 00000009
0000000A 00000074 00000074 00000074 00000009
0000000B 0000007D 0000007D 0000007D 00000009
0000000C 00000086 00000086 00000086 00000009
0000000D 0000008F 0000008F 0000008F 00000009
0000000E 00000098 00000098 00000098 00000009
0000000F 000000A1 000000A1 000000A1 00000009
00000010 000000AA 000000AA 000000AA 00000009
00000011 000000B3 000000B3 000000B3 00000009
00000012 000000BC 000000BC 000000BC 00000009
00000013 000000C5 000000C5 000000C5 00000009
the cost per count is the same but the call overhead is more expensive
this allows me to use thumb mode just for fun, to avoid the mode change the linker added I made it a little faster (and consistent).
000080cc <COUNTER>:
80cc: e92d4001 push {r0, lr}
80d0: e59f103c ldr r1, [pc, #60] ; 8114 <edel+0x2>
80d4: e59fe03c ldr lr, [pc, #60] ; 8118 <edel+0x6>
80d8: ee192f1d mrc 15, 0, r2, cr9, cr13, {0}
80dc: e12fff11 bx r1
000080e0 <here>:
80e0: ee193f1d mrc 15, 0, r3, cr9, cr13, {0}
80e4: e8bd4001 pop {r0, lr}
80e8: e0430002 sub r0, r3, r2
80ec: e12fff1e bx lr
000080f0 <delay>:
80f0: e2500001 subs r0, r0, #1
80f4: 4a000000 bmi 80fc <end_delay>
80f8: eafffffc b 80f0 <delay>
000080fc <end_delay>:
80fc: e12fff1e bx lr
8100: e1a00000 nop ; (mov r0, r0)
8104: e1a00000 nop ; (mov r0, r0)
8108: e1a00000 nop ; (mov r0, r0)
0000810c <del>:
810c: 3801 subs r0, #1
810e: d400 bmi.n 8112 <edel>
8110: e7fc b.n 810c <del>
00008112 <edel>:
8112: 4770 bx lr
00000000 000000F4 0000001B 0000001B
00000001 00000024 00000024 00000024 00000009
00000002 0000002D 0000002D 0000002D 00000009
00000003 00000036 00000036 00000036 00000009
00000004 0000003F 0000003F 0000003F 00000009
00000005 00000048 00000048 00000048 00000009
00000006 00000051 00000051 00000051 00000009
00000007 0000005A 0000005A 0000005A 00000009
00000008 00000063 00000063 00000063 00000009
00000009 0000006C 0000006C 0000006C 00000009
0000000A 00000075 00000075 00000075 00000009
0000000B 0000007E 0000007E 0000007E 00000009
0000000C 00000087 00000087 00000087 00000009
0000000D 00000090 00000090 00000090 00000009
0000000E 00000099 00000099 00000099 00000009
0000000F 000000A2 000000A2 000000A2 00000009
00000010 000000AB 000000AB 000000AB 00000009
00000011 000000B4 000000B4 000000B4 00000009
00000012 000000BD 000000BD 000000BD 00000009
00000013 000000C6 000000C6 000000C6 00000009
with this alignment
0000810e <del>:
810e: 3801 subs r0, #1
8110: d400 bmi.n 8114 <edel>
8112: e7fc b.n 810e <del>
00008114 <edel>:
8114: 4770 bx lr
00000000 0000007E 0000001C 0000001C
00000001 00000026 00000026 00000026 0000000A
00000002 00000030 00000030 00000030 0000000A
00000003 0000003A 0000003A 0000003A 0000000A
00000004 00000044 00000044 00000044 0000000A
00000005 0000004E 0000004E 0000004E 0000000A
00000006 00000058 00000058 00000058 0000000A
00000007 00000062 00000062 00000062 0000000A
00000008 0000006C 0000006C 0000006C 0000000A
00000009 00000076 00000076 00000076 0000000A
0000000A 00000080 00000080 00000080 0000000A
0000000B 0000008A 0000008A 0000008A 0000000A
0000000C 00000094 00000094 00000094 0000000A
0000000D 0000009E 0000009E 0000009E 0000000A
0000000E 000000A8 000000A8 000000A8 0000000A
0000000F 000000B2 000000B2 000000B2 0000000A
00000010 000000BC 000000BC 000000BC 0000000A
00000011 000000C6 000000C6 000000C6 0000000A
00000012 000000D0 000000D0 000000D0 0000000A
00000013 000000DA 000000DA 000000DA 0000000A
so in some ideal world on this processor assuming a cache hit on the delay code
00000004 00000032 00000032 00000032 0000000A
00000004 0000002D 0000002D 0000002D 00000009
00000004 0000003E 0000003E 0000003E 00000009
00000004 0000003F 0000003F 0000003F 00000009
00000004 00000044 00000044 00000044 0000000A
between 0x2D and 0x44 clocks to run that loop with r0 = 4
Realistically on this platform without the cache enabled and/or what you might see if you get a cache miss.
00000000 0000030B 000002B7 000002ED
00000001 0000035B 00000389 000003E9
00000002 000003FB 00000439 0000041B
00000003 0000058F 000004E7 0000055B
00000004 000005FF 0000069D 000006D1
00000005 00000745 00000733 000006F7
00000006 00000883 00000817 00000801
00000007 00000873 00000853 0000089B
00000008 00000923 00000B05 0000092F
00000009 00000A3F 000009A9 00000B4D
0000000A 00000B79 00000BA9 00000C57
0000000B 00000C21 00000D13 00000B51
0000000C 00000C0B 00000E91 00000DE9
0000000D 00000D97 00000E0D 00000E81
0000000E 00000E5B 0000100B 00000F25
0000000F 00001097 00001095 00000F37
00000010 000010DB 000010FD 0000118B
00000011 00001071 0000114D 0000123F
00000012 000012CF 0000126D 000011DB
00000013 0000140D 0000143D 0000141B
000002B7 0000143D
the r0=4 line
00000004 000005FF 0000069D 000006D1
thats a lot of cpu counts...
Hopefully I have put this topic to bed. While it is interesting to try to assume how fast code runs or how many counts, etc...It is not that simple on these types of processors, pipelines, caches, branch prediction, complicated system busses, using a common-ish core in various chip implementations where the chip vendor manages the memory/flash separate from the processor IP vendors code.
I didnt mess with branch prediction on this second experiment, had I done that then alignment would not be so consistent, depending on how branch prediction is implemented it can vary its usefulness based on where the branch is relative to the fetch line as the next fetch has started or not or is a certain way through when the branch predictor determines it doesnt need to do that fetch and/or starts the branched fetch, in this case the branch is two ahead so you might not see it with this code, you would want some nops sprinkled in between so that the bmi destination is in a separate fetch line (in order to see the difference).
And this is the easy stuff to manipulate, using the same machine code sequences and seeing those vary in execution time by what did we see. between 0x3F and 0x6D1 that is over 27x difference between fastest and slowest...for the same machine code. changing the alignment of the code by one instruction (somewhere else in unrelated code has one more or one fewer instructions from a prior build) was 5 counts difference.
to be fair the mrc at the end of the test was probably part of the time
000080c8 <COUNTER>:
80c8: ee192f1d mrc 15, 0, r2, cr9, cr13, {0}
80cc: ee193f1d mrc 15, 0, r3, cr9, cr13, {0}
80d0: e0430002 sub r0, r3, r2
80d4: e12fff1e bx lr
resulted in a count of 1 with either alignment. so doesnt guarantee that it was only one count of error in the measurement, but likely wasnt a dozen.
Anyway, I hope this helps your understanding.

I feel intuitively that 1 CPU cycle should be used for each instruction, so if we began with r0 =4 it would take 11 CPU cycles to complete the following code is that correct ?
Given that most of the ARM CPUs have 3-8 pipeline stages it would be difficult to say that most of the instruction would take 1 CPU cycle to complete. Ideally in a pipelined CPU there should one instruction retiring every clock cycle but since the code above has branch statements, this makes it difficult to judge when each instruction retires. The reason being that we don't know about how the branches are dealt as this would depend on the Branch Predictor algorithm present in the processor design. Accordingly, if the prediction is correct there wouldn't be any bubbles inserted in the pipeline but if it is in-correctly predicted then it depends on the internal pipeline structure on how many bubbles would be inserted. For an ideal 5-stage pipeline there would be 2 bubbles inserted for every mis-prediction. But again this depends on the internal micro-architecture implementation.
As a result it would be difficult to accurately predict how many cycles the above code would take.

Related

ELF Binary: why symbol value is different from actual symbol address? [duplicate]

readelf output of the object file:
Symbol table '.symtab' contains 15 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 FILE LOCAL DEFAULT ABS fp16.c
2: 00000000 0 SECTION LOCAL DEFAULT 1
3: 00000000 0 SECTION LOCAL DEFAULT 3
4: 00000000 0 SECTION LOCAL DEFAULT 4
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 $t
6: 00000001 194 FUNC LOCAL DEFAULT 1 __gnu_f2h_internal
7: 00000010 0 NOTYPE LOCAL DEFAULT 5 $d
8: 00000000 0 SECTION LOCAL DEFAULT 5
9: 00000000 0 SECTION LOCAL DEFAULT 7
10: 000000c5 78 FUNC GLOBAL HIDDEN 1 __gnu_h2f_internal
11: 00000115 4 FUNC GLOBAL HIDDEN 1 __gnu_f2h_ieee
12: 00000119 4 FUNC GLOBAL HIDDEN 1 __gnu_h2f_ieee
13: 0000011d 4 FUNC GLOBAL HIDDEN 1 __gnu_f2h_alternative
14: 00000121 4 FUNC GLOBAL HIDDEN 1 __gnu_h2f_alternative
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000124 00 AX 0 0 4
[ 2] .rel.text REL 00000000 00058c 000010 08 9 1 4
[ 3] .data PROGBITS 00000000 000158 000000 00 WA 0 0 1
[ 4] .bss NOBITS 00000000 000158 000000 00 WA 0 0 1
[ 5] .debug_frame PROGBITS 00000000 000158 00008c 00 0 0 4
[ 6] .rel.debug_frame REL 00000000 00059c 000060 08 9 5 4
[ 7] .ARM.attributes ARM_ATTRIBUTES 00000000 0001e4 00002f 00 0 0 1
[ 8] .shstrtab STRTAB 00000000 000213 000051 00 0 0 1
[ 9] .symtab SYMTAB 00000000 00041c 0000f0 10 10 10 4
[10] .strtab STRTAB 00000000 00050c 00007e 00 0 0 1
Relocation section '.rel.text' at offset 0x58c contains 2 entries:
Offset Info Type Sym.Value Sym. Name
0000011a 00000a66 R_ARM_THM_JUMP11 000000c5 __gnu_h2f_internal
00000122 00000a66 R_ARM_THM_JUMP11 000000c5 __gnu_h2f_internal
Relocation section '.rel.debug_frame' at offset 0x59c contains 12 entries:
Offset Info Type Sym.Value Sym. Name
00000014 00000802 R_ARM_ABS32 00000000 .debug_frame
00000018 00000202 R_ARM_ABS32 00000000 .text
00000040 00000802 R_ARM_ABS32 00000000 .debug_frame
00000044 00000202 R_ARM_ABS32 00000000 .text
00000050 00000802 R_ARM_ABS32 00000000 .debug_frame
00000054 00000202 R_ARM_ABS32 00000000 .text
00000060 00000802 R_ARM_ABS32 00000000 .debug_frame
00000064 00000202 R_ARM_ABS32 00000000 .text
00000070 00000802 R_ARM_ABS32 00000000 .debug_frame
00000074 00000202 R_ARM_ABS32 00000000 .text
00000080 00000802 R_ARM_ABS32 00000000 .debug_frame
00000084 00000202 R_ARM_ABS32 00000000 .text
.text section structure as I understand it:
.text section has size of 0x124
0x0: unknown byte
0x1-0xC3: __gnu_f2h_internal
0xC3-0xC5: two unknown bytes between those functions (btw what are those?)
0xC5-0x113: __gnu_h2f_internal
0x113-0x115: two unknown bytes between those functions
0x115-0x119: __gnu_f2h_ieee
0x119-0x11D: __gnu_h2f_ieee
0x11D-0x121: __gnu_f2h_alternative
0x121-0x125: __gnu_h2f_alternative // section is only 0x124, what happened to the missing byte?
Notice that the section size is 0x124 and the last function end in 0x125, what happend to the missing byte?
Thanks.
Technically, your "missing byte" is the one right there at 0x0.
Note that you're looking at the value of the symbol, i.e. the runtime function address (this would be a lot clearer if your .text section VMA wasn't 0). Since they're Thumb functions, the addresses have bit 0 set such that the processor will switch to Thumb mode when calling them; the actual locations of those instructions are still halfword-aligned, i.e. 0x0, 0xc4, 0x114, etc. since they couldn't be executed otherwise (you'd take a fault for a misaligned PC). Strip off bit 0 as per what the ARM
ELF spec says about STT_FUNC symbols to get the actual VMA of the instruction corresponding to that symbol, then subtract the start of the section and you should have the same relative offset as within the object file itself.
<offset in section> = (<symbol value> & ~1) - <section VMA>
The extra halfword padding after some functions just ensures each symbol is word-aligned - there are probably various reasons for this, but the first one that comes to mind is that the adr instruction wouldn't work properly if they weren't.

Incorrect function size inside ARM ELF object

readelf output of the object file:
Symbol table '.symtab' contains 15 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 FILE LOCAL DEFAULT ABS fp16.c
2: 00000000 0 SECTION LOCAL DEFAULT 1
3: 00000000 0 SECTION LOCAL DEFAULT 3
4: 00000000 0 SECTION LOCAL DEFAULT 4
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 $t
6: 00000001 194 FUNC LOCAL DEFAULT 1 __gnu_f2h_internal
7: 00000010 0 NOTYPE LOCAL DEFAULT 5 $d
8: 00000000 0 SECTION LOCAL DEFAULT 5
9: 00000000 0 SECTION LOCAL DEFAULT 7
10: 000000c5 78 FUNC GLOBAL HIDDEN 1 __gnu_h2f_internal
11: 00000115 4 FUNC GLOBAL HIDDEN 1 __gnu_f2h_ieee
12: 00000119 4 FUNC GLOBAL HIDDEN 1 __gnu_h2f_ieee
13: 0000011d 4 FUNC GLOBAL HIDDEN 1 __gnu_f2h_alternative
14: 00000121 4 FUNC GLOBAL HIDDEN 1 __gnu_h2f_alternative
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000124 00 AX 0 0 4
[ 2] .rel.text REL 00000000 00058c 000010 08 9 1 4
[ 3] .data PROGBITS 00000000 000158 000000 00 WA 0 0 1
[ 4] .bss NOBITS 00000000 000158 000000 00 WA 0 0 1
[ 5] .debug_frame PROGBITS 00000000 000158 00008c 00 0 0 4
[ 6] .rel.debug_frame REL 00000000 00059c 000060 08 9 5 4
[ 7] .ARM.attributes ARM_ATTRIBUTES 00000000 0001e4 00002f 00 0 0 1
[ 8] .shstrtab STRTAB 00000000 000213 000051 00 0 0 1
[ 9] .symtab SYMTAB 00000000 00041c 0000f0 10 10 10 4
[10] .strtab STRTAB 00000000 00050c 00007e 00 0 0 1
Relocation section '.rel.text' at offset 0x58c contains 2 entries:
Offset Info Type Sym.Value Sym. Name
0000011a 00000a66 R_ARM_THM_JUMP11 000000c5 __gnu_h2f_internal
00000122 00000a66 R_ARM_THM_JUMP11 000000c5 __gnu_h2f_internal
Relocation section '.rel.debug_frame' at offset 0x59c contains 12 entries:
Offset Info Type Sym.Value Sym. Name
00000014 00000802 R_ARM_ABS32 00000000 .debug_frame
00000018 00000202 R_ARM_ABS32 00000000 .text
00000040 00000802 R_ARM_ABS32 00000000 .debug_frame
00000044 00000202 R_ARM_ABS32 00000000 .text
00000050 00000802 R_ARM_ABS32 00000000 .debug_frame
00000054 00000202 R_ARM_ABS32 00000000 .text
00000060 00000802 R_ARM_ABS32 00000000 .debug_frame
00000064 00000202 R_ARM_ABS32 00000000 .text
00000070 00000802 R_ARM_ABS32 00000000 .debug_frame
00000074 00000202 R_ARM_ABS32 00000000 .text
00000080 00000802 R_ARM_ABS32 00000000 .debug_frame
00000084 00000202 R_ARM_ABS32 00000000 .text
.text section structure as I understand it:
.text section has size of 0x124
0x0: unknown byte
0x1-0xC3: __gnu_f2h_internal
0xC3-0xC5: two unknown bytes between those functions (btw what are those?)
0xC5-0x113: __gnu_h2f_internal
0x113-0x115: two unknown bytes between those functions
0x115-0x119: __gnu_f2h_ieee
0x119-0x11D: __gnu_h2f_ieee
0x11D-0x121: __gnu_f2h_alternative
0x121-0x125: __gnu_h2f_alternative // section is only 0x124, what happened to the missing byte?
Notice that the section size is 0x124 and the last function end in 0x125, what happend to the missing byte?
Thanks.
Technically, your "missing byte" is the one right there at 0x0.
Note that you're looking at the value of the symbol, i.e. the runtime function address (this would be a lot clearer if your .text section VMA wasn't 0). Since they're Thumb functions, the addresses have bit 0 set such that the processor will switch to Thumb mode when calling them; the actual locations of those instructions are still halfword-aligned, i.e. 0x0, 0xc4, 0x114, etc. since they couldn't be executed otherwise (you'd take a fault for a misaligned PC). Strip off bit 0 as per what the ARM
ELF spec says about STT_FUNC symbols to get the actual VMA of the instruction corresponding to that symbol, then subtract the start of the section and you should have the same relative offset as within the object file itself.
<offset in section> = (<symbol value> & ~1) - <section VMA>
The extra halfword padding after some functions just ensures each symbol is word-aligned - there are probably various reasons for this, but the first one that comes to mind is that the adr instruction wouldn't work properly if they weren't.

Seg Fault in ARM Assembly

So, I am trying to learn ARM assembly and basically what I want to do is turn on the LEDs of my BeagleBone Black using pure assembly. I know how to program in C very well, but I am new to ARM assembly if that makes any difference.
Basically I am just trying to modify a character in a string, but it doesn't seem to be working. Maybe it is because I do not fully understand the memory management instructions.
When I run the code it gives me a segmentation fault.
Here is my code:
.syntax unified
.global main
main:
push {ip, lr}
mov r0, beagle_bone_0
mov r1, #0x65
strb r1, [r0]
ldr r0, =beagle_bone_0
bl printf
pop {ip, pc}
beagle_bone_0:
.asciz "/sys/class/leds/beaglebone:green:usr0/brightness"
objdump -x output:
helloworld: file format elf32-littlearm
helloworld
architecture: arm, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x00008325
Program Header:
0x70000001 off 0x00000444 vaddr 0x00008444 paddr 0x00008444 align 2**2
filesz 0x00000008 memsz 0x00000008 flags r--
PHDR off 0x00000034 vaddr 0x00008034 paddr 0x00008034 align 2**2
filesz 0x00000100 memsz 0x00000100 flags r-x
INTERP off 0x00000134 vaddr 0x00008134 paddr 0x00008134 align 2**0
filesz 0x00000019 memsz 0x00000019 flags r--
LOAD off 0x00000000 vaddr 0x00008000 paddr 0x00008000 align 2**15
filesz 0x00000450 memsz 0x00000450 flags r-x
LOAD off 0x00000450 vaddr 0x00010450 paddr 0x00010450 align 2**15
filesz 0x00000124 memsz 0x00000128 flags rw-
DYNAMIC off 0x0000045c vaddr 0x0001045c paddr 0x0001045c align 2**2
filesz 0x000000f0 memsz 0x000000f0 flags rw-
NOTE off 0x00000150 vaddr 0x00008150 paddr 0x00008150 align 2**2
filesz 0x00000044 memsz 0x00000044 flags r--
STACK off 0x00000000 vaddr 0x00000000 paddr 0x00000000 align 2**2
filesz 0x00000000 memsz 0x00000000 flags rwx
Dynamic Section:
NEEDED libc.so.6
INIT 0x000082d1
FINI 0x00008439
INIT_ARRAY 0x00010450
INIT_ARRAYSZ 0x00000004
FINI_ARRAY 0x00010454
FINI_ARRAYSZ 0x00000004
HASH 0x00008194
GNU_HASH 0x000081bc
STRTAB 0x00008238
SYMTAB 0x000081e8
STRSZ 0x00000043
SYMENT 0x00000010
DEBUG 0x00000000
PLTGOT 0x0001054c
PLTRELSZ 0x00000020
PLTREL 0x00000011
JMPREL 0x000082b0
REL 0x000082a8
RELSZ 0x00000008
RELENT 0x00000008
VERNEED 0x00008288
VERNEEDNUM 0x00000001
VERSYM 0x0000827c
Version References:
required from libc.so.6:
0x0d696914 0x00 02 GLIBC_2.4
private flags = 5000002: [Version5 EABI] [has entry point]
Sections:
Idx Name Size VMA LMA File off Algn
0 .interp 00000019 00008134 00008134 00000134 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .note.ABI-tag 00000020 00008150 00008150 00000150 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .note.gnu.build-id 00000024 00008170 00008170 00000170 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .hash 00000028 00008194 00008194 00000194 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .gnu.hash 0000002c 000081bc 000081bc 000001bc 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .dynsym 00000050 000081e8 000081e8 000001e8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 .dynstr 00000043 00008238 00008238 00000238 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 .gnu.version 0000000a 0000827c 0000827c 0000027c 2**1
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .gnu.version_r 00000020 00008288 00008288 00000288 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 .rel.dyn 00000008 000082a8 000082a8 000002a8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
10 .rel.plt 00000020 000082b0 000082b0 000002b0 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
11 .init 0000000a 000082d0 000082d0 000002d0 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
12 .plt 00000048 000082dc 000082dc 000002dc 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
13 .text 00000114 00008324 00008324 00000324 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
14 .fini 00000006 00008438 00008438 00000438 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
15 .rodata 00000004 00008440 00008440 00000440 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
16 .ARM.exidx 00000008 00008444 00008444 00000444 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
17 .eh_frame 00000004 0000844c 0000844c 0000044c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
18 .init_array 00000004 00010450 00010450 00000450 2**2
CONTENTS, ALLOC, LOAD, DATA
19 .fini_array 00000004 00010454 00010454 00000454 2**2
CONTENTS, ALLOC, LOAD, DATA
20 .jcr 00000004 00010458 00010458 00000458 2**2
CONTENTS, ALLOC, LOAD, DATA
21 .dynamic 000000f0 0001045c 0001045c 0000045c 2**2
CONTENTS, ALLOC, LOAD, DATA
22 .got 00000020 0001054c 0001054c 0000054c 2**2
CONTENTS, ALLOC, LOAD, DATA
23 .data 00000008 0001056c 0001056c 0000056c 2**2
CONTENTS, ALLOC, LOAD, DATA
24 .bss 00000004 00010574 00010574 00000574 2**0
ALLOC
25 .comment 0000001d 00000000 00000000 00000574 2**0
CONTENTS, READONLY
26 .ARM.attributes 00000031 00000000 00000000 00000591 2**0
CONTENTS, READONLY
SYMBOL TABLE:
00008134 l d .interp 00000000 .interp
00008150 l d .note.ABI-tag 00000000 .note.ABI-tag
00008170 l d .note.gnu.build-id 00000000 .note.gnu.build-id
00008194 l d .hash 00000000 .hash
000081bc l d .gnu.hash 00000000 .gnu.hash
000081e8 l d .dynsym 00000000 .dynsym
00008238 l d .dynstr 00000000 .dynstr
0000827c l d .gnu.version 00000000 .gnu.version
00008288 l d .gnu.version_r 00000000 .gnu.version_r
000082a8 l d .rel.dyn 00000000 .rel.dyn
000082b0 l d .rel.plt 00000000 .rel.plt
000082d0 l d .init 00000000 .init
000082dc l d .plt 00000000 .plt
00008324 l d .text 00000000 .text
00008438 l d .fini 00000000 .fini
00008440 l d .rodata 00000000 .rodata
00008444 l d .ARM.exidx 00000000 .ARM.exidx
0000844c l d .eh_frame 00000000 .eh_frame
00010450 l d .init_array 00000000 .init_array
00010454 l d .fini_array 00000000 .fini_array
00010458 l d .jcr 00000000 .jcr
0001045c l d .dynamic 00000000 .dynamic
0001054c l d .got 00000000 .got
0001056c l d .data 00000000 .data
00010574 l d .bss 00000000 .bss
00000000 l d .comment 00000000 .comment
00000000 l d .ARM.attributes 00000000 .ARM.attributes
0000835c l F .text 00000000 call_gmon_start
00000000 l df *ABS* 00000000 crtstuff.c
00010458 l O .jcr 00000000 __JCR_LIST__
00008374 l F .text 00000000 __do_global_dtors_aux
00010574 l O .bss 00000001 completed.5637
00010454 l O .fini_array 00000000 __do_global_dtors_aux_fini_array_entry
00008384 l F .text 00000000 frame_dummy
00010450 l O .init_array 00000000 __frame_dummy_init_array_entry
000083b8 l .text 00000000 beagle_bone_0
00000000 l df *ABS* 00000000 crtstuff.c
0000844c l O .eh_frame 00000000 __FRAME_END__
00010458 l O .jcr 00000000 __JCR_END__
00010454 l .init_array 00000000 __init_array_end
0001045c l O .dynamic 00000000 _DYNAMIC
00010450 l .init_array 00000000 __init_array_start
0001054c l O .got 00000000 _GLOBAL_OFFSET_TABLE_
00008434 g F .text 00000002 __libc_csu_fini
0001056c w .data 00000000 data_start
000082f0 F *UND* 00000000 printf##GLIBC_2.4
00010574 g *ABS* 00000000 __bss_start__
00010578 g *ABS* 00000000 _bss_end__
00010574 g *ABS* 00000000 _edata
00008438 g F .fini 00000000 _fini
00010578 g *ABS* 00000000 __bss_end__
0001056c g .data 00000000 __data_start
000082fc F *UND* 00000000 __libc_start_main##GLIBC_2.4
00000000 w *UND* 00000000 __gmon_start__
00010570 g O .data 00000000 .hidden __dso_handle
00008440 g O .rodata 00000004 _IO_stdin_used
000083f0 g F .text 00000044 __libc_csu_init
00010578 g *ABS* 00000000 _end
00008324 g F .text 00000000 _start
00010578 g *ABS* 00000000 __end__
00010574 g *ABS* 00000000 __bss_start
0000839c g .text 00000000 main
00000000 w *UND* 00000000 _Jv_RegisterClasses
00008318 F *UND* 00000000 abort##GLIBC_2.4
000082d0 g F .init 00000000 _init
The answer to my question was actually really simple. Since ldr r0, =beagle_bone_0 loads the address of beagle_bone_0 into register 0 I can just manipulate beagle_bone_0 with that address.
Working test code:
.syntax unified
.data
beagle_bone_0: .ascii "Hello, world\n"
.text
.global main
main:
push {ip, lr}
ldr r0, =beagle_bone_0
mov r1, #0x65
strb r1, [r0]
bl printf
pop {ip, pc}
I ran and debugged your code. The line mov r0, beagle_bone_0 didn't even compile (on my compiler, at least). You want to load in r0 the address of beagle_bone. For this, you should use the adr pseudo-instruction, that is translated by the compiler in a pc-relative move (something like mov r0, [pc, #8]. You cannot use it this way. Probably your compiler translated it into something different.
So, to fix it, just replace the line mov r0, beagle_bone_0 by adr r0, beagle_bone_0.
Also the string was in the .text section which we cannot edit. So, I put beagle_bone_0 in the .data section.

What should the value of %esp be at this point in the code?

I've been having trouble getting this code to work.
test $0x10000000, %esp
jz .ERROR
ret
If it jumps to .ERROR, the code just exits. Otherwise the output prints as normal.
When I use test $0x0000000, %esp it quits as I would expect.
These are my sections:
Sections:
Idx Name Size VMA LMA File off Algn
0 .interp 00000013 08048114 08048114 00000114 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .note.ABI-tag 00000020 08048128 08048128 00000128 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .hash 00000038 08048148 08048148 00000148 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .dynsym 00000090 08048180 08048180 00000180 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .dynstr 00000064 08048210 08048210 00000210 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .gnu.version 00000012 08048274 08048274 00000274 2**1
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 .gnu.version_r 00000020 08048288 08048288 00000288 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 .rel.dyn 00000010 080482a8 080482a8 000002a8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .rel.plt 00000030 080482b8 080482b8 000002b8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 .init 00000024 080482e8 080482e8 000002e8 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
10 .plt 00000070 08048310 08048310 00000310 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
11 .text 00000188 08048380 08048380 00000380 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
12 .springboard 00000023 08048508 08048508 00000508 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
13 .fini 00000015 0804852c 0804852c 0000052c 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
14 .rodata 00000024 08048544 08048544 00000544 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
15 .eh_frame 000000e0 08048568 08048568 00000568 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
16 .dynamic 000000c8 08049648 08049648 00000648 2**2
CONTENTS, ALLOC, LOAD, DATA
17 .got 00000004 08049710 08049710 00000710 2**2
CONTENTS, ALLOC, LOAD, DATA
18 .got.plt 00000024 08049714 08049714 00000714 2**2
CONTENTS, ALLOC, LOAD, DATA
19 .data 00000004 08049738 08049738 00000738 2**2
CONTENTS, ALLOC, LOAD, DATA
20 .bss 00000004 0804973c 0804973c 0000073c 2**2
ALLOC
21 .comment 0000002a 00000000 00000000 0000073c 2**0
CONTENTS, READONLY
Maybe I don't understand this yet, but should %esp be equal to addresses in that range?
I can move the .springboard section to 0x10000000 if I link it with my linker script. The return goes to the springboard section. So my thought was that it shouldn't work here, but if I link it with my script and the springboard section is moved, then it will work. Why is it working in both cases?
I'm guessing the test is returning a non-zero value but I don't understand why.
No, esp is a stack pointer, so it should point to some address inside stack. Your program doesn't seem to provide any stack section, so I guess the OS allocates the stack.
Well, if you are about to return from a function, dword ptr [esp] (but not esp) should indeed contain an address from the sections above, as this should be an address of the next instruction to be executed after the function call.

Converting ARM to C

Given, for example, the following ARM assembly code, are there any straightforward ways to convert it directly to C, using whatever appropriate variable names?
ADD $2 $0 #9
ADD $3 $0 #3
ADD $1 $0 $0
loop: ADD $1 $1 #1
ADD $3 $0 $3, LSL #1
SUB $2 $2 $1
CMP $2 $1
BNE loop
Also, as I'm still learning ARM, how many times will the loop execute say, SUB or ADD? Are there straightforward ways to determine this?
Thanks for the help! Any other insight not particularly aimed at answering the question would also be great.
In short, BNE - Branch Not Equal, could suggest either a do{...}while loop or the other way while (...){...}, even possibly a for( ...; ... < ....; ...){...} loop, that's about far as it can go.
As for reading the addition/subtraction from some registers (read, memory variables in the context of C), you will have to play by reading it and come up with a near equivalent.
A decompiler may not help you at this stage, play with a couple of C code to practice and compile it to assembler language using the -S command parameter passed to the C compiler and see what you get, mostly trial and error am afraid, that is, if you're looking for the exact replica of that code in the above question.
unsigned int r0,r1,r2,r3;
r2=r0+9;
r3=r0+3;
r1=r0+r0;
do
{
r1=r1+1;
r3=r0+(r3<<1);
r2=r2-r1;
} while(r2!=r1);
not knowing what r0 is going in the loop can happen a few times or many times (like millions? billions?) r2 is decreasing, r1 is increasing if they dont collide with an equals the first time they pass they will have to roll around. every loop r1 gets bigger so r2 gets smaller that much faster. should be very easy to add a printf and some test values for r0 and see what happens.
say for example r0 is a 0 before entering this code. r2 is r0+9 = 9; and r1 is double r0 which is 0.
The first so many loops would go like this with the four variables r0,r1,r2,r3
00000000 00000001 00000008 00000006
00000000 00000002 00000007 0000000C
00000000 00000003 00000006 00000018
00000000 00000004 00000005 00000030
00000000 00000005 00000004 00000060
00000000 00000006 00000003 000000C0
00000000 00000007 00000002 00000180
00000000 00000008 00000001 00000300
00000000 00000009 00000000 00000600
00000000 0000000A FFFFFFFF 00000C00
00000000 0000000B FFFFFFFE 00001800
r2 and r1 are not going to collide.
but if r0 was a 1 going in then
00000001 00000003 00000009 00000009
00000001 00000004 00000008 00000013
00000001 00000005 00000007 00000027
00000001 00000006 00000006 0000004F
r0 = 3
00000003 00000007 0000000B 0000000F
00000003 00000008 0000000A 00000021
00000003 00000009 00000009 00000045
r0 needs to be odd so far. but when you make r0 a 9 then
00000009 00000013 00000011 00000021
00000009 00000014 00000010 0000004B
00000009 00000015 0000000F 0000009F
00000009 00000016 0000000E 00000147
00000009 00000017 0000000D 00000297
00000009 00000018 0000000C 00000537
00000009 00000019 0000000B 00000A77
00000009 0000001A 0000000A 000014F7
00000009 0000001B 00000009 000029F7
00000009 0000001C 00000008 000053F7
00000009 0000001D 00000007 0000A7F7
00000009 0000001E 00000006 00014FF7
00000009 0000001F 00000005 00029FF7
00000009 00000020 00000004 00053FF7
00000009 00000021 00000003 000A7FF7
00000009 00000022 00000002 0014FFF7
00000009 00000023 00000001 0029FFF7
00000009 00000024 00000000 0053FFF7
00000009 00000025 FFFFFFFF 00A7FFF7
00000009 00000026 FFFFFFFE 014FFFF7
basically it is a little deterministic with some rules, but if the comparison doesnt happen then the loop may run forever or at least many many cycles.

Resources