Are Assembly programs almost the same size as C programs - c

For example: I created a simple C program that prints "Hello, World", compiled it and it created an executable that had a size of 39.8Kb.
following this question I was able to create the equivalent but written in Assembly the size of this program was 39.6Kb.
This surprised me greatly as I expected the assembly program to be smaller than the C program. As the question indicated it uses a C header and the gcc compiler. Would this make the assembly program bigger or is it normal for them to be both roughly the same size?
Using the strip command I reduced both files. This removed debug code and now both have very similar file sizes. Both 18.5Kb.
test.c:

If your hand written code is on par with a compiled function, then sure they are going to be similar in size, they are doing the same thing and if you can compete with a compiler you will be the same or similar.
Now your file sizes indicate you are looking at the wrong thing all together. The file you are looking at while called a binary has a ton of other stuff in it. You want to compare apples to apples in this context then compare the size of the functions, the machine code, not the size of the container that holds the functions plus debug info plus strings plus a number of other things.
Your experiment is flawed but the results very loosely indicate the expected result. But that is if you are producing code in the same way. The odds of that are slim so saying that no you shouldnt expect similar results unless you are producing code in the same way.
take this simple function
unsigned int fun ( unsigned int a, unsigned int b)
{
return(a+b+1);
}
the same compiler produced this:
00000000 <fun>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd00c sub sp, sp, #12
c: e50b0008 str r0, [r11, #-8]
10: e50b100c str r1, [r11, #-12]
14: e51b2008 ldr r2, [r11, #-8]
18: e51b300c ldr r3, [r11, #-12]
1c: e0823003 add r3, r2, r3
20: e2833001 add r3, r3, #1
24: e1a00003 mov r0, r3
28: e28bd000 add sp, r11, #0
2c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
30: e12fff1e bx lr
and this
00000000 <fun>:
0: e2811001 add r1, r1, #1
4: e0810000 add r0, r1, r0
8: e12fff1e bx lr
because of different settings. 13 instructions vs 3, over 4 times larger.
A human might generate this directly from the C, nothing fancy
add r0,r0,r1
add r0,r0,#1
bx lr
not sure from order of operations if you technically have to add the one to b before adding that sum to a. Or if it doesnt matter. I went left to right the compiler went right to left.
so you could say that the compiler and my assembly produced the same number of bytes of binary, or you could say that the compiler produced something over 4 times larger.
Take the above and expand that into a real program that does useful things.
Exercise to the reader (the OP, please dont spoil it) to figure out why the compiler can produce two different correct solutions that are so different in size.
EDIT
.exe, elf and other "binary" formats as mentioned can contain debug information, ascii strings that contain names of functions/labels that make for pretty debug screens. Which are part of the "binary" in that they are part of the baggage but are not machine code nor data used when executing that program, at least not the stuff I am mentioning. You can without changing the machine code nor data the program needs, manipulate the size of your .exe or other file format using compiler settings, so the same compiler-assembler-linker or assembler-linker path can make the binary file in some senses of that word larger or smaller by including or not this additional baggage. So that is part of understanding file sizes and why perhaps even if your hello world programs were different sizes, the overall file might be around the same size, if one is 10 bytes longer but the .exe is 40K then that 10 bytes is in the noise. But if I understand your question, that 10 bytes is what you are interested in knowing how it compares between compiled and hand written C.
Also note that compilers are made by humans, so the output they produce is on par with what at least those humans can produce, other humans can do better, many do worse depending on your definition of better and worse.

the size 39+ Kb absolute not related to compiler and language used (c/c++ or asm) different optimizations, debug information, etc - can change size of this tinny code on say 1000 bytes. but not more. i for test build next program
#include <Windows.h>
#include <stdio.h>
void ep(void*)
{
ExitProcess(printf("Hello, World"));
}
linker options:
/INCREMENTAL:NO /NOLOGO /MANIFEST:NO /NODEFAULTLIB
/SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /ENTRY:"ep" /MACHINE:X64 kernel32.lib msvcrt.lib
and got size 2560 bytes exe for both x86/x64.
in what different ? in /NODEFAULTLIB and my version of msvcrt.lib - which is pure import library.
the rest 35kb+ size you give by used static linked c runtime. even if you write program on asm - you need use some lib for link to printf. and your lib containing some code which is static linked with your code. in this code this 35kb.
task is not c++ vs asm - no different here. task in use c-runtime or not use

I agree with old_time but I also did a quick test for ground truth. With VS-2017 Pro, I get similar results (~37KB) on the size of the executable, but only if I look in the debug output folder. After building for release, it's closer to ~9KB. Much of that difference is in the size of the static libraries needed to call into the OS/C-runtime DLL's.
EDIT: Despite the fact that most modern C compilers can match or out-perform most hand written assembly code, the hand written variety can be smaller by virtue of the fact that it doesn't have to have all that C run-time over-head, but the difference is rarely enough to warrant the extra development and maintenance costs of assembler code, particularly for non-trivial applications. There's a reason that most of the modern OS kernels are written predominantly in C or other high-level languages with only pin-hole assembler optimizations in a handful of critical functions.
Trivial "hello world" class programs are not a good comparison for C vs assembler. There's just not enough opportunities for the compiler or the human to do much in the way of optimization. Write a math or data processing library and application and compare those. I'd be willing to bet the compiler will kick your but.

Related

What would be a reason to use ADDS instruction instead of the ADD instruction in ARM assembly?

My course notes always use ADDS and SUBS in their ARM code snippets, instead of the ADD and SUB as I would expect. Here's one such snippet for example:
__asm void my_capitalize(char *str)
{
cap_loop
LDRB r1, [r0] // Load byte into r1 from memory pointed to by r0 (str pointer)
CMP r1, #'a'-1 // compare it with the character before 'a'
BLS cap_skip // If byte is lower or same, then skip this byte
CMP r1, #'z' // Compare it with the 'z' character
BHI cap_skip // If it is higher, then skip this byte
SUBS r1,#32 // Else subtract out difference to capitalize it
STRB r1, [r0] // Store the capitalized byte back in memory
cap_skip
ADDS r0, r0, #1 // Increment str pointer
CMP r1, #0 // Was the byte 0?
BNE cap_loop // If not, repeat the loop
BX lr // Else return from subroutine
}
This simple code for example converts all lowercase English in a string to uppercase. What I do not understand in this code is why they are not using ADD and SUB commands instead of ADDS and SUBS currently being used. The ADDS and SUBS command, afaik, update the APSR flags NZCV, for later use. However, as you can see in the above snippet, the updated values are not being utilized. Is there any other utility of this command then?
Arithmetic instructions (ADD, SUB, etc) don't modify the status flag, unlike comparison instructions (CMP,TEQ) which update the condition flags by default. However, adding the S to the arithmetic instructions(ADDS, SUBS, etc) will update the condition flags according to the result of the operation. That is the only point of using the S for the arithmetic instructions, so if the cf are not going to be checked, there is no reason to use ADDS instead of ADD.
There are more codes to append to the instruction (link), in order to achieve different purposes, such as CC (the conditional flag C=0), hence:
ADDCC: do the operation if the carry status bit is set to 0.
ADDCCS: do the operation if the carry status bit is set to 0 and afterwards, update the status flags (if C=1, the status flags are not overwritten).
From the cycles point of view, there is no difference between updating the conditional flags or not. Considering an ARMv6-M as example, ADDS and ADD will take 1 cycle.
Discard the use of ADD might look like a lazy choice, since ADD is quite useful for some cases. Going further, consider these examples:
SUBS r0, r0, #1
ADDS r0, r0, #2
BNE go_wherever
and
SUBS r0, r0, #1
ADD r0, r0, #2
BNE go_wherever
may yield different behaviours.
As old_timer has pointed out, the UAL becomes quite relevant on this topic. Talking about the unified language, the preferred syntax is ADDS, instead of ADD (link). So the OP's code is absolutely fine (even recommended) if the purpose is to be assembled for Thumb and/or ARM (using UAL).
ADD without the flag update is not available on some cortex-ms. If you look at the arm documentation for the instruction set (always a good idea when doing assembly language) for general purpose use cases that is not available until a thumb2 extension on armv7-m (cortex-m3, cortex-m4, cortex-m7). The cortex-m0 and cortex-m0+ and generally wide compatibility code (which would use armv4t or armv6-m) doesn't have an add without flags option. So perhaps that is why.
The other reason may be to get the 16-bit instruction not the 32, but but that is a slippery slope as it gets even more into assemblers and their syntax (syntax is defined by the assembler, the program that processes assembly language, not the target). For example not syntax unified gas:
.thumb
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
The disassembler knows reality but the assembler doesn't:
so.s: Assembler messages:
so.s:2: Error: instruction not supported in Thumb16 mode -- `adds r1,r2,r3'
but
.syntax unified
.thumb
adds r1,r2,r3
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: eb02 0103 add.w r1, r2, r3
So not slippery in this case, but with the unified syntax you start to get into blahw, blah.w, blah, type syntax and have to spin back around to check to see that the instructions you wanted are being generated. Non-unified has its own games as well, and of course all of this is assembler-specific.
I suspect they were either going with the only choice they had, or were using the smaller and more compatible instruction, especially if this were a class or text, the more compatible the better.

Why do I get the same address every time I build + disassemble a function inside GDB?

Every time when I disassemble a function, why do I always get the same instruction address and constants' address?
For example, after executing the following commands,
gcc -o hello hello.c -ggdb
gdb hello
(gdb) disassemble main
the dump code would be:
When I quit gdb and re-disassemble the main function, I will get the same result as before. The instruction address and even the address of constants are always the same for each disassemble command in gdb. Why is that? Does the compiled file hello contain certain information about the address of each assembly instruction as well as the constants' addresses?
If you made a position-independent executable (e.g. with gcc -fpie -pie, which is the default for gcc in many recent Linux distros), the kernel would randomize the address it mapped your executable at. (Except when running under GDB: GDB disables ASLR by default even for shared libraries, and for PIE executables.)
But you're making a position-dependent executable, which can take advantage of static addresses being link-time constants (by using them as immediates and so on without needing runtime relocation fixups). e.g. you or the compiler can use mov $msg, %edi (like your code) instead of lea msg, %rdi (with -fpie).
Regular (position-dependent) executables have their load-address set in the ELF headers: use readelf -a ./a.out to see the ELF metadata.
A non-PIE executable will load at the same time every time even without running it under GDB, at the address specified in the ELF program headers.
(gcc / ld chooses 0x400000 by default on x86-64-linux-elf; you could change this with a linker script). Relocation information for all the static addresses hard-coded into the code + data is not available, so the loader couldn't fix up the addresses even if it wanted to.
e.g. in a simple executable (with only a text segment, not data or bss) I built with -no-pie (which seems to be the default in your gcc):
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000c5 0x00000000000000c5 R E 0x200000
Section to Segment mapping:
Segment Sections...
00 .text
So the ELF headers request that offset 0 in the file be mapped to virtual address 0x0000000000400000. (And the ELF entry point is 0x400080; that's where _start is.) I'm not sure what the relevance of PhysAddr = VirtAddr is; user-space executables don't know and can't easily find out what physical addresses the kernel used for pages of RAM backing their virtual memory, and it can change at any time as pages are swapped in / out.
Note that readelf does line wrapping; note there are two rows of columns headers. The 0x200000 is the Align column for that one LOADed segment.
By default, the GNU toolchain for x86-64 Linux produces position-dependent executables which are mapped at address 0x400000. (position-independent executables will be mapped at 0x55… addresses instead). It is possible to change that by building GCC --enable-default-pie, or by specifying compiler and linker flags.
However, even for a position-independent executable (PIE), the addresses would be constant between GDB runs because GDB disables address space layout randomization by default. GDB does this so that breakpoints at absolute addresses can be re-applied after the program has been started.
There are a variety of executable file formats. Typically, an executable file contains information anout several memory sections or segments. Inside the executable, references to memory addresses may be expressed relative to the beginning of a section. The executable also contains a relocation table. The relocation table is a list of those references, including where each one is in the executable, what section it refers to, and what type of reference it is (what field of an instruction it is used in, etc.).
The loader (software that loads your program into memory) reads the executable and writes the sections to memory. In your case, the loader appears to be using the same base addresses for sections every time it runs. After initially putting the sections in memory, the loader reads the relocation table and uses it to fix up all the references to memory by adjusting them based on where each section was loaded into memory. For example, the compiler may write an instruction as, in effect, “Load register 3 from the start of the data section plus 278 bytes.” If the loader puts the data section at address 2000, it will adjust this instruction to use the sum of 2000 and 278, making “Load register 3 from address 2278.”
Good modern loaders randomize where sections are loaded. They do this because malicious people are sometimes able to exploit bugs in programs to cause them to execute code injected by the attacker. Randomizing section locations prevents the attacker from knowing the address where their code will be injected, which can hinder their ability to prepare the code to be injected. Since your addresses are not changing, it appears your loader does not do this. You may be using an older system.
Some processor architectures and/or loaders support position independent code (PIC). In this case, the form of an instruction may be “Load register 3 from 694 bytes beyond where this instruction is.” In that case, as long as the data is always at the same distance from the instruction, it does not matter where they are in memory. When the process executes the instruction, it will add the address of the instruction to 694, and that will be the address of the data. Another way of implementing PIC-like code is for the loader to provide the addresses of each section to the program, by putting those addresses in registers or fixed locations in memory. Then the program can use those base addresses to do its own address calculations. Since your program has an address built into the code, it does not appear your program is using these methods.
a not intended to be really executed program
bootstrap
.globl _start
_start:
bl one
b .
first c file
extern unsigned int hello;
unsigned int one ( void )
{
return(hello+5);
}
second c file (being extern forces the compiler to compile the first object in a certain way)
unsigned int hello;
linker script
MEMORY
{
ram : ORIGIN = 0x00001000, LENGTH = 0x4000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > ram
}
building position dependent
Disassembly of section .text:
00001000 <_start>:
1000: eb000000 bl 1008 <one>
1004: eafffffe b 1004 <_start+0x4>
00001008 <one>:
1008: e59f3008 ldr r3, [pc, #8] ; 1018 <one+0x10>
100c: e5930000 ldr r0, [r3]
1010: e2800005 add r0, r0, #5
1014: e12fff1e bx lr
1018: 0000101c andeq r1, r0, r12, lsl r0
Disassembly of section .bss:
0000101c <hello>:
101c: 00000000 andeq r0, r0, r0
the key here is at address 0x1018 the compiler had to leave a placeholder for the address to the external item. shown as offset 0x10 below
00000000 <one>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <one+0x10>
4: e5930000 ldr r0, [r3]
8: e2800005 add r0, r0, #5
c: e12fff1e bx lr
10: 00000000 andeq r0, r0, r0
The linker fills this in at link time. You can see in the disassembly above that position dependent it fills in the absolute address of where to find that item. For this code to work the code must be loaded in a way that that item shows up at that address. It has to be loaded at a specific position or address in memory. Position dependent. (loaded at address 0x1000 basically).
If your toolchain supports position independent (gnu does) then this represents a solution.
Disassembly of section .text:
00001000 <_start>:
1000: eb000000 bl 1008 <one>
1004: eafffffe b 1004 <_start+0x4>
00001008 <one>:
1008: e59f3014 ldr r3, [pc, #20] ; 1024 <one+0x1c>
100c: e59f2014 ldr r2, [pc, #20] ; 1028 <one+0x20>
1010: e08f3003 add r3, pc, r3
1014: e7933002 ldr r3, [r3, r2]
1018: e5930000 ldr r0, [r3]
101c: e2800005 add r0, r0, #5
1020: e12fff1e bx lr
1024: 00000014 andeq r0, r0, r4, lsl r0
1028: 00000000 andeq r0, r0, r0
Disassembly of section .got:
0000102c <.got>:
102c: 0000103c andeq r1, r0, r12, lsr r0
Disassembly of section .got.plt:
00001030 <_GLOBAL_OFFSET_TABLE_>:
...
Disassembly of section .bss:
0000103c <hello>:
103c: 00000000 andeq r0, r0, r0
It has a performance hit of course, but instead of the compiler and linker working together by leaving one location, there is now a table, global offset table (for this solution) that is at a known location which is position relative to the code, that contains linker supplied offsets.
The program is not position independent yet, it will certainly not work if you load it anywhere. The loader has to patch up the table/solution based on where it wants to place the items. This is far simpler than having a very long list of each of the locations to patch in the first solution, although that would have been a very possible way to do it. A table in the executable (executables contain more than the program and data they have other items of information as you know if you objdump or readelf an elf file) could contain all of those offsets and the loader could patch those up too.
If your data and bss and other memory sections are fixed relative to .text as I have built here, then a got wasnt necessary the linker could have at link time computed the relative offset to the resource and along with the compiler found the item in an position independent way, and the binary could have been loaded just about anywhere (some minimum alignment may hav been required) and it would work without any patching. With the gnu solution I think you can move the segments relative to each other.
It is incorrect to state that the kernel will or would always randomize your location if built position independent. While possible so long as the toolchain and the loader from the operating system (a completely separate development) work hand in hand, the loader has the opportunity. But that does not in any way mean that every loader does or will. Specific operating systems/distros/versions may have that set as a default yes. If they come across a binary that is position independent (built in a way that loader expects). It is like saying all mechanics on the planet will use a specific brand and type of oil if you show up in their garage with a specific brand of car. A specific mechanic may always use a specific oil brand and type for a specific car, but that doesnt mean all mechanics will or perhaps even can obtain that specific oil brand or type. If that individual business chooses to as a policy then you as a customer can start to form an assumption that that is what you will get (with that assumption then failing when they change their policy).
As far as disassembly you can statically disassemble your project at build time or whenever. If loaded at a different position then there will be an offset to what you are seeing, but the .text code will still be in the same place relative to other code in that segment. If the static disassembly shows a call being 0x104 bytes ahead, then even if loaded somewhere else you should see that relative jump also be 0x104 bytes ahead, the addresses may be different.
Then there is the debugger part of this, for the debugger to work/show the correct information it also has to be part of the toolchain/loader(/os) team for everything to work/look right. It has to know this was position independent and have to know where it was loaded and/or the debugger is doing the loading for you and may not use the standard OS loader in the same way that a command line or gui does. So you might still see the binary in the same place every time when using the debugger.
The main bug here was your expectation. First operating systems like windows, linux, etc desire to use an MMU to allow them to manage memory better. To pick some/many non-linear blocks of physical memory and create a linear area of virtual memory for your program to live, more importantly the virtual address space for each separate program can look the same, I can have every program load at 0x8000 in virtual address space, without interfering with each other, with an MMU designed for this and an operating system that takes advantage of this. Even with this MMU and operating system and position independent loading one would hope they are not using physical addresses, they are still creating a virtual address space, just possibly with different load points for each program or each instance of a program. Expecting all operating systems to do this all the time is an expectation problem. And when using a debugger you are not in a stock environment, the program runs differently, can be loaded differently, etc. It is not the same as running without the debugger, so using a debugger also changes what you should expect to see happen. Two levels of expectation here to deal with.
Use an external component in a very simple program as I made above, see in the disassembly of the object that it has built for position independence as well as in the linking then try Linux as Peter has indicated and see if it loads in a different place each time, if not then you need to be looking at superuser SE or google around about how to use linux (and/or gdb) to get it to change the load location.

Sorting ARM Assembly

I am newbie. I have difficulties with understanding memory ARM memory map.
I have found example of simple sorting algorithm
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortc
; r0 = &arr[0]
; r1 = length
__sortc
stmfd sp!, {r2-r9, lr}
mov r4, r1 ; inner loop counter
mov r3, r4
sub r1, r1, #1
mov r9, r1 ; outer loop counter
outer_loop
mov r5, r0
mov r4, r3
inner_loop
ldr r6, [r5], #4
ldr r7, [r5]
cmp r7, r6
; swap without swp
strls r6, [r5]
strls r7, [r5, #-4]
subs r4, r4, #1
bne inner_loop
subs r9, r9, #1
bne outer_loop
ldmfd sp!, {r2-r9, pc}^
END
And this assembly should be called this way from C code
#define MAX_ELEMENTS 10
extern void __sortc(int *, int);
int main()
{
int arr[MAX_ELEMENTS] = {5, 4, 1, 3, 2, 12, 55, 64, 77, 10};
__sortc(arr, MAX_ELEMENTS);
return 0;
}
As far as I understand this code creates array of integers on the stack and calls _sortc function which implemented in assembly. This function takes this values from the stack and sorts them and put back on the stack. Am I right ?
I wonder how can I implement this example using only assembly.
For example defining array of integers
DCD 3, 7, 2, 8, 5, 7, 2, 6
BTW Where DCD declared variables are stored in the memory ??
How can I operate with values declared in this way ? Please explain how can I implement this using assembly only without any C code, even without stack, just with raw data.
I am writing for ARM7TDMI architecture
AREA ARM, CODE, READONLY - this marks start of section for code in the source.
With similar AREA myData, DATA, READWRITE you can start section where it's possible to define data like data1 DCD 1,2,3, this will compile as three words with values 1, 2, 3 in consecutive bytes, with label data1 pointing to the first byte of first word. (some AREA docs from google).
Where these will land in physical memory after loading executable depends on how the executable is linked (linker is using a script file which is helping him to decide which AREA to put where, and how to create symbol table for dynamic relocation done by the executable loader, by editing the linker script you can adjust where the code and data land, but normally you don't need to do that).
Also the linker script and assembler directives can affect size of available stack, and where it is mapped in physical memory.
So for your particular platform: google for memory mappings on web and check the linker script (for start just use linker option to produce .map file to see where the code and data are targeted to land).
So you can either declare that array in some data area, then to work with it, you load symbol data1 into register ("load address of data1"), and use that to fetch memory content from that address.
Or you can first put all the numbers into the stack (which is set probably to something reasonable by the OS loader of your executable), and operate in the code with the stack pointer to access the numbers in it.
You can even DCD some values into CODE area, so those words will end between the instructions in memory mapped as read-only by executable loader. You can read those data, but writing to them will likely cause crash. And of course you shouldn't execute them as instructions by accident (forgetting to put some ret/jump instruction ahead of DCD).
without stack
Well, this one is tricky, you have to be careful to not use any call/etc. and to have interrupts disabled, etc.. basically any thing what needs stack.
When people code a bootloader, usually they set up some temporary stack ASAP in first few instructions, so they can use basic stack functionality before setting up whole environment properly, or loading OS. A space for that temporary stack is often reserved somewhere in/after the code, or an unused memory space according to defined machine state after reset.
If you are down to the metal, without OS, usually all memory is writeable after reset, so you can then intermix code and data as you wish (just jumping around the data, not executing them by accident), without using AREA definitions.
But you should make your mind, whether you are creating application in user space of some OS (so you have things like stack and data areas well defined and you can use them for your convenience), or you are creating boot loader code which has to set it all up for itself (more difficult, so I would suggest at first going into user land of some OS, having C wrapper around with clib initialized is often handy too, so you can call things like printf from ASM for convenient output).
How can I operate with values declared in this way
It doesn't matter in machine code, which way the values were declared. All that matters is, if you have address of the memory, and if you know the structure, how the data are stored there. Then you can work with them in any way you want, using any instruction you want. So body of that asm example will not change, if you allocate the data in ASM, you will just pass the pointer as argument to it, like the C does.
edit: some example done blindly without testing, may need further syntax fixing to work for OP (or maybe there's even some bug and it will not work at all, let me know in comments if it did):
AREA myData, DATA, READWRITE
SortArray
DCD 5, 4, 1, 3, 2, 12, 55, 64, 77, 10
SortArrayEnd
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortasmarray
__sortasmarray
; if "add r0, pc, #SortArray" fails (code too far in memory from array)
; then this looks like some heavy weight way of loading any address
; ldr r0, =SortArray
; ldr r1, =SortArrayEnd
add r0, pc, #SortArray ; address of array
; calculate array size from address of end
; (as I couldn't find now example of thing like "equ $-SortArray")
add r1, pc, #SortArrayEnd
sub r1, r1, r0
mov r1, r1, lsr #2
; do a direct jump instead of "bl", so __sortc returning
; to lr will actually return to called of this
b __sortc
; ... rest of your __sortc assembly without change
You can call it from C code as:
extern void __sortasmarray();
int main()
{
__sortasmarray();
return 0;
}
I used among others this Introducing ARM assembly language to refresh my ARM asm memory, but I'm still worried this may not work as is.
As you can see, I didn't change any thing in the __sortc. Because there's no difference in accessing stack memory, or "dcd" memory, it's the same computer memory. Once you have the address to particular word, you can ldr/str it's value with that address. The __sortc receives address of first word in array to sort in both cases, from there on it's just memory for it, without any context how that memory was defined in source, allocated, initialized, etc. As long as it's writeable, it's fine for __sortc.
So the only "dcd" related thing from me is loading array address, and the quick search for ARM examples shows it may be done in several ways, this add rX, pc, #label way is optimal, but does work only for +-4k range? There's also pseudo instruction ADR rX, #label doing this same thing, and maybe switching to other in case of range problem? For any range it looks like ldr rX, = label form is used, although I'm not sure if it's pseudo instruction or how it works, check some tutorials and disassembly the machine code to see how it was compiled.
It's up to you to learn all the ARM assembly peculiarities and how to load addresses of arrays, I don't need ARM ASM at the moment, so I didn't dig into those details.
And there should be some equ way to define length of array, instead of calculating it in code from end address, but I couldn't find any example, and I'm not going to read full Assembler docs to learn about all it's directives (in gas I think ArrayLength equ ((.-SortArray)/4) would work).

Memory Mapping in Microcotroller

1. #define timers ((dual_timers *)0x03FF6000)
This is a memory map definition used in an ARM Microcontroller
where the structure definition is
2. struct dual_timers
{
special_register TMOD;
special_register TDATA0;
special_register TDATA1;
special_register TCNT0;
special_register TCNT1;
};
What the meaning of(dual_timers *)0x03FF6000) ?, is it type casting .
if it is typecasting please explain its influence in the code.
How would the compiler see the definition 'timers' after this?
This has been asked and answered countless times here.
First off the structure thing is a bad idea, not portable not reliable, even though it is used as often as it isnt in vendors code. Little time bombs waiting to go off and have you pay them for support perhaps.
Your define is just elementary C. It is a typecast, I have this address happens to be hardcoded, in C programming class we might have used the name of some other pointer and likely not the define
unsigned int *bob;
unsigned char *ted = (unsigned char *)bob;
(yet another programming trick you should never use). And you can spin that around as a define
#define ted (unsigned char *)bob
Or something to that effect. bob is just an address with a human readable name.
For this to work you need a volatile in there (which it isnt?) and they have yet another typedef somewhere that defines dual_timers so they dont have to keep typing volatile unsigned int or volatile uint32_t or volatile uint8_t or whatever size the registers are. The volatile is because you know but the compiler doesnt that you are pointing at hardware not ram, you need the compiler to perform all of the loads and stores and not optimize any out.
In addition you need the compiler to perform the right sized loads and stores, if it is a register that can only be accessed with 32 bit wide transactions, you need the compiler to implement this with the right instructions. And no matter what you do that is not a guarantee, this programming style can and if you are unlucky will fail for you. It is a very wide spread practice, but it is not foolproof. It and even worse than making pointers to absolute addresses is using structures across a compile domain, hardware is a separate compile domain from your code. You cannot guarantee no matter how many compiler specific directives you find, that that code will remain working as time goes on and compilers are upgraded or if god forbid you try to compile on some other computer. It may work 99.9999% of the time but that time that it fails is a massive failure that earthquake once in a zillion years that wipes out all of Tokyo. As you see in kernel drivers using an abstraction makes for portable code, in bare metal you can implement that abstraction in assembly language and guarantee the correct instruction is used. It can cost you some cycles, so you can create a define/typedef just like the one you are asking about for the abstraction, but your code is not forced into that and a complete re-write of your code is not required if you need to port that code or work around a chip errata, etc. the latter is my personal opinion and style based on decades of experience in bare metal programming.
The define is just an elementary C typedef nothing special or fancy just read it like any other C syntax to understand what it is doing. The struct is a way of applying offsets to that address, so if we assume that all of these registers are 32 bit then the "desire" is to have accesses to TMOD be at address 0x03FF6000+0x00, accesses to TDATA0 be at address 0x03FF6000+0x04, TDATA1 0x03FF6000+0x08 and so no. But again there is nothing here that insures that is actually going to happen nor does it insure that 32 bit loads or stores are used. A simple disassembly of the code will show these addresses being generated for these accesses.
I assume you tried using code like this to see what it did:
typedef volatile unsigned int special_register;
typedef struct
{
special_register TMOD;
special_register TDATA0;
special_register TDATA1;
special_register TCNT0;
special_register TCNT1;
} dual_timers;
#define timers ((dual_timers *)0x03FF6000)
unsigned int fun ( void )
{
timers->TMOD=5;
timers->TDATA0|=1;
timers->TCNT0=timers->TCNT1;
return(timers->TDATA1);
}
for arm as you mentioned producing
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f301c ldr r3, [pc, #28] ; 28 <fun+0x28>
8: e5832000 str r2, [r3]
c: e5932004 ldr r2, [r3, #4]
10: e3822001 orr r2, r2, #1
14: e5832004 str r2, [r3, #4]
18: e5932010 ldr r2, [r3, #16]
1c: e583200c str r2, [r3, #12]
20: e5930008 ldr r0, [r3, #8]
24: e12fff1e bx lr
28: 03ff6000 mvnseq r6, #0
Yes it is type casting. It basically says that starting from address 0x03FF6000 you can consider that there is a dual_timers structure.
In this context, I guess that special_register is defined as something like volatile unsigned uint32_t.
This is a typical way of easily accessing the registers of a microncontroller. For accessing the register TDATA0 for example, in your code you will need to use timers->TDATA0
It means that there is a pointer to the structure dual_timers and the value of the pointer is 0x03FF6000, i.e. it is pointing to the structure located at 0x03FF6000.
The compiler (in fact preprocessor) sees the expression (dual_timers *)0x03FF6000) every time it looks at the word timers. For you it looks like timers->TDATA0 but for the compiler it looks like (dual_timers *)0x03FF6000)->TDATA0, take TDATA0 field of dual_timers structure located at 0x03FF6000.

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?
The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!
Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

Resources