Programming STM32F4x WITHOUT IDE on Debian - c

This is my first question on this website, and I'm not sure about my English..
I want to know if there is a way to program the Nucleo STM32F446RE (via USB, not via JTAG) WITHOUT using any IDE.
For training purpose, I want to program with only a text editor (i use kate), Makefiles and command line.
What I already found/installed:
gcc-arm-none-eabi (6-2017-q2-update)
It contains, i think, all we need to compile (but I don't think there is a asm compiler in there).
There is example of code in C, and makefiles (that I don't totally understand). It seems to compile well (I tried the "minimum" example).
Here is the example I used:
#ifndef __NO_SYSTEM_INIT
void SystemInit()
{}
#endif
void main()
{
for (;;);
}
And here is the Makefile:
# Selecting Core
CORTEX_M=4
# Use newlib-nano. To disable it, specify USE_NANO=
USE_NANO=--specs=nano.specs
# Use seimhosting or not
USE_SEMIHOST=--specs=rdimon.specs
USE_NOHOST=--specs=nosys.specs
CORE=CM$(CORTEX_M)
BASE=../..
# Compiler & Linker
CC=arm-none-eabi-gcc
CXX=arm-none-eabi-g++
# Options for specific architecture
ARCH_FLAGS=-mthumb -mcpu=cortex-m$(CORTEX_M)
# Startup code
STARTUP=$(BASE)/startup/startup_ARM$(CORE).S
# -Os -flto -ffunction-sections -fdata-sections to compile for code size
CFLAGS=$(ARCH_FLAGS) $(STARTUP_DEFS) -Os -flto -ffunction-sections -fdata-sections
CXXFLAGS=$(CFLAGS)
# Link for code size
GC=-Wl,--gc-sections
# Create map file
MAP=-Wl,-Map=$(NAME).map
NAME=minimum
STARTUP_DEFS=-D__STARTUP_CLEAR_BSS -D__START=main
LDSCRIPTS=-L. -L$(BASE)/ldscripts -T nokeep.ld
LFLAGS=$(USE_NANO) $(USE_NOHOST) $(LDSCRIPTS) $(GC) $(MAP)
$(NAME)-$(CORE).axf: $(NAME).c $(STARTUP)
$(CC) $^ $(CFLAGS) $(LFLAGS) -o $#
clean:
rm -f $(NAME)*.axf *.map *.o
I modified it in order to set cortex-m4 instead of cortex-m0.
After running the make command I get minimum.map and minimum.axf files.
But I don't know how to load the object code in the device. ( and is it normal not to have a minimum.o file ? )

I would call something like this a minimal example with C code, the infinite loop is not necessary in this case, but is inspired by yours.
vectors.s
.thumb
.globl _start
_start:
.word 0x20002000
.word reset
.word done
.word done
.thumb_func
reset:
bl centry
b done
.thumb_func
done:
b done
so.c
void centry ( void )
{
for(;;) continue;
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
build
arm-none-eabi-as vectors.s -o vectors.o
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T flash.ld vectors.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
examine
08000000 <_start>:
8000000: 20002000 andcs r2, r0, r0
8000004: 08000011 stmdaeq r0, {r0, r4}
8000008: 08000017 stmdaeq r0, {r0, r1, r2, r4}
800000c: 08000017 stmdaeq r0, {r0, r1, r2, r4}
08000010 <reset>:
8000010: f000 f802 bl 8000018 <centry>
8000014: e7ff b.n 8000016 <done>
08000016 <done>:
8000016: e7fe b.n 8000016 <done>
08000018 <centry>:
8000018: e7fe b.n 8000018 <centry>
800001a: 46c0 nop ; (mov r8, r8)
Likely not required but read the docs, folks use 0x08000000, technically it is 0x00000000, the stm32 family maps 0x08000000 to 0x00000000 as described in the documentation based on the boot pins. Inspection needs to show that the vector table is the first thing, you have told the toolchain these are thumb addresses in the vector table (lsbit is set). Could have put the C entry function (main() is not required, that is just a convention) in the vector table as the reset function. I have no .data nor .bss initialization so something like this would not allow the use of .data nor assuming .bss variables are zero, have to write before you read. Adding more code to the bootstrap (and linker script) would allow for that.
arm-none-eabi-objcopy so.elf -O binary so.bin
Will create a binary that depending on the tools you use may be used to load the program. If this is a nucleo board you can copy that file to the virtual usb drive. Clearly this program wont show anything interesting. Using openocd or other SWD debugger software (if you have a nucleo board you dont need any other hardware) you can stop and restart the program to try to see it running.
You can read the documentation to see the addresses and how to program the peripherals.
thumb2 is just extensions to thumb, you can stick with traditional thumb or add cortex-m4 or armv7m to the command line (cpu/arch) to try to reduce the number of instructions but trade off for larger instructions.
there are no doubt tools out there but it is fairly easy to write your own program to interface with the serial bootloader to download your program into the device.

Related

qemu-arm with Cortex-M4 on Linux

I am using qemu-arm and the ARM Workbench IDE to run/profile an ARM binary which was built with armcc/armlink (an .axf-File, program written in C). This works fine with Cortex-A9 and ARM926/ARM5TE. However, whatever I tried, it doesnt work when the binary is built for Cortex-M4. Both the simulator and qemu-arm hang when M4 is selected as CPU.
I know that this processor requires some additional startup code, but I could find any comprehensive tutorial on how to get it running. Does anyone know how to do this? I have a quite big project with one main function, but it would already help if a "hello world" or some simple program which takes arguments would run.
Here is the command line I am using with Cortex-A9:
qemu-system-arm -machine versatileab -cpu cortex-a9 -nographic -monitor null -semihosting -append 'some program arguments' -kernel program.axf
I do not know how to do it with the versatilepb, it did not "just work", but this does work:
flash.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
notmain.c
void PUT32 ( unsigned int, unsigned int );
#define UART0BASE 0x4000C000
int notmain ( void )
{
unsigned int rx;
for(rx=0;rx<8;rx++)
{
PUT32(UART0BASE+0x00,0x30+(rx&7));
}
return(0);
}
flash.ld
ENTRY(_start)
MEMORY
{
rom : ORIGIN = 0x00000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
(I am told the entry point being a thumb function address is critical YMMV)
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c notmain.c -o notmain.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy -O binary notmain.elf notmain.bin
check the vector table, etc.
00000000 <_start>:
0: 20001000
4: 0000000d
8: 00000013
0000000c <reset>:
c: f000 f804 bl 18 <notmain>
10: e7ff b.n 12 <hang>
00000012 <hang>:
12: e7fe b.n 12 <hang>
Looks good.
And run it
qemu-system-arm -M lm3s811evb -m 8K -nographic -kernel notmain.bin
01234567
Then ctrl-a then x to exit
QEMU: Terminated
-cpu cortex-m4 works as well as one would expect. Would have to try to find things different between the m3 and m4 that might show up in a sim like this and go from there.
After Luminary Micro (acquired by ti a while ago now) I do not think anyone else put the effort in for a machine. But as already discussed in at least one question at this site, you can run the cores (an exercise for the reader).
For versatilepb
int notmain ( void )
{
unsigned int ra;
for(ra=0;;ra++)
{
ra&=7;
PUT32(0x101f1000,0x30+ra);
}
return(0);
}
qemu-system-arm -machine versatileab -cpu cortex-m4 -nographic -monitor null -kernel notmain.elf
qemu-system-arm: This board cannot be used with Cortex-M CPUs
You can't arbitrarily plug different CPU types into an Arm board model. If you try it then the resulting system may work by luck, or may crash, or have odd behaviour; in some cases the -cpu option will just be ignored. This is because the CPU integration with the board matters: things like interrupt controllers are part of the board, not the CPU, but not all CPUs will work with all interrupt controllers. Often QEMU is not as good as it could be about detecting and reporting errors for user options that aren't valid.
In this case you're probably using an older QEMU: newer ones will correctly report:
qemu-system-arm: This board cannot be used with Cortex-M CPUs
if you try to use '-machine versatilepb' with '-cpu cortex-m4'. Older ones would either crash or just misbehave.
Generally the best thing is to use the CPU type that the board has by default (ie don't specify a -cpu option), for every board type except the "virt" board. If you want to write code for a Cortex-M4, you should look for a board type that has a Cortex-M4. The mps2-an386 is probably a good option. (If your QEMU doesn't have that board type, upgrade to a newer one: there have been a lot of M-profile emulation bug fixes anyway that you'll want to have.)

How to use the enhanced multiplier instructions of ARMv5TE instruction set

I'm using an ARM966E-S RISC-CPU and was wondering how to use the apparently available instruction set extensions for better DSP performance, e. g. an enhanced multiplier instruction.
I've read in the technical reference manual that these instruction set extensions are available but I don't know how to use/activate them.
Can anybody help?
Thanks in advance!
Why not just try it? Or read the manual for your toolchain, for example with gcc
so.s
ldrd r0,[r2]
ldr r2,[r2]
test
arm-none-eabi-as so.s -o so.o
arm-none-eabi-as -march=armv5t so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: selected processor does not support `ldrd r0,[r2]' in ARM mode
arm-none-eabi-as -march=armv5te so.s -o so.o
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <.text>:
0: e1c200d0 ldrd r0, [r2]
4: e5922000 ldr r2, [r2]

where is aeabi_fmul being linked from?

I've been running code on the ARM M0+ core and i see that the vast majority of my time is spent in floating point calculations. So I am experimenting with a custom floating point calculation function for use in very low power applications.
I've been using ARM GCC for bare metal compile on an M0+ (without a hard FPU). I see that floating point multiplication gets linked to __aeabi_fmul and then linked to generate the final ELF file.
My questions are as follows:
Where is __aeabi_fmul defined? Is it in a pre-compiled library that comes with GCC?
Is it possible to change this definition in some way? Maybe have a pre-compiled version of my_fp_mul instead and link to that instead of __aeabi_fmul?
I understand that the second part needs me to mess with the compiler. I've been looking into CLANG/LLVM to do this since general consensus seems to be that its easier to modify than GCC! I'm just trying to see if this is even something thats possible or im barking up the entirely wrong tree here.
thank you
It is part of gcc, the gcc library, download the gcc sources and search for those functions and you will find them. They are soft float routines and are hand tuned and you are unlikely to do a significantly better job, but knock yourself out. Not sure why you would do any floating point on an MCU like that but thankfully the language and the tools allow you although it can consume a lot of flash and execution time. (not doing any float variables but doing the floating point math yourself with fixed point is a possible compromise or just do fixed point).
If you use gcc to link then gcc knows where the libraries are and will pull them in automatically, if you use ld to link (using gcc just as a compiler not the caller of everything in the toolchain) then ld does not know where to find the libraries and you can simply add your own object on the command line, this is the simplest way.
You can take the as-is gnu source for a particular function and add it to your project then modify it or just completely replace it with your own function.
Naturally you can go into the compiler sources and rename things then re-build the compiler, not sure just how much work you want to do here, replacing the floating point routines without mistakes is already a large task, as mentioned in comments I would leave the compiler alone and just work with it (leave the names the same link with ld).
start.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang
.thumb_func
reset:
bl notmain
.thumb_func
hang: b .
so.c
float notmain ( float a, float b )
{
return(a+b);
}
memmap
MEMORY
{
rom : ORIGIN = 0x00000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > rom
}
build
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -Xlinker -T -Xlinker memmap -nostdlib -nostartfiles -ffreestanding -mthumb start.o so.c -o so.elf -lgcc
arm-none-eabi-objdump -D so.elf
it doesnt complain but makes a perfectly broken binary
20000048 <__addsf3>:
20000048: e1b02080 lsls r2, r0, #1
2000004c: 11b03081 lslsne r3, r1, #1
20000050: 11320003 teqne r2, r3
20000054: 11f0cc42 mvnsne r12, r2, asr #24
20000058: 11f0cc43 mvnsne r12, r3, asr #24
2000005c: 0a000047 beq 20000180 <__addsf3+0x138>
20000060: e1a02c22 lsr r2, r2, #24
20000064: e0723c23 rsbs r3, r2, r3, lsr #24
20000068: c0822003 addgt r2, r2, r3
2000006c: c0201001 eorgt r1, r0, r1
20000070: c0210000 eorgt r0, r1, r0
those are arm instructions not thumb. examining what the linker was passed.
0:[/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/../../../../arm-none-eabi/bin/ld]
1:[-plugin]
2:[/opt/gnuarm/libexec/gcc/arm-none-eabi/7.1.0/liblto_plugin.so]
3:[-plugin-opt=/opt/gnuarm/libexec/gcc/arm-none-eabi/7.1.0/lto-wrapper]
4:[-plugin-opt=-fresolution=/tmp/ccSyISCJ.res]
5:[-X]
6:[-o]
7:[so.elf]
8:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/thumb]
9:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0]
10:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/../../../../arm-none-eabi/lib]
11:[-T]
12:[memmap]
13:[start.o]
14:[/tmp/ccrdRU2s.o]
15:[-lgcc]
the other approach
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T memmap start.o so.o /opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/thumb/libgcc.a -o so.elf
but this is still broken
20000038 <__addsf3>:
20000038: e1b02080 lsls r2, r0, #1
2000003c: 11b03081 lslsne r3, r1, #1
20000040: 11320003 teqne r2, r3
20000044: 11f0cc42 mvnsne r12, r2, asr #24
20000048: 11f0cc43 mvnsne r12, r3, asr #24
2000004c: 0a000047 beq 20000170 <__addsf3+0x138>
20000050: e1a02c22 lsr r2, r2, #24
20000054: e0723c23 rsbs r3, r2, r3, lsr #24
I have not done the things I need to do to get the right library, have to run will re-edit this later...
But my proposed solution is:
.thumb_func
.globl __aeabi_fadd
__aeabi_fadd:
bx lr
I added to start.s for demonstration purposes
arm-none-eabi-as start.s -o start.o
arm-none-eabi-ld -T memmap start.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf
Disassembly of section .text:
20000000 <_start>:
20000000: 20001000 andcs r1, r0, r0
20000004: 20000015 andcs r0, r0, r5, lsl r0
20000008: 20000019 andcs r0, r0, r9, lsl r0
2000000c: 20000019 andcs r0, r0, r9, lsl r0
20000010: 20000019 andcs r0, r0, r9, lsl r0
20000014 <reset>:
20000014: f000 f802 bl 2000001c <notmain>
20000018 <hang>:
20000018: e7fe b.n 20000018 <hang>
2000001a <__aeabi_fadd>:
2000001a: 4770 bx lr
2000001c <notmain>:
2000001c: b510 push {r4, lr}
2000001e: f7ff fffc bl 2000001a <__aeabi_fadd>
20000022: bc10 pop {r4}
20000024: bc02 pop {r1}
20000026: 4708 bx r1
then fill in whatever you want, clearly this is not a real program, broke many rules, there are no numbers being passed in, etc...
But the compiler generated __aeabi_fadd and I supplied an __aeabi_fadd and it was happy.
What I have done in the past, is, since I build my own gnu toolchain anyway, go in and put a syntax error in the file of interest, do the build, then the long command line used to build that item is now on the screen when it fails, isolate the function of interest, use the long command line for gcc as a guide, tweak and tune as desired...Get there faster than trying to figure out all the defines on your own in the code.

Is it possible to create a basic bare-metal Assembly bootup/startup program using only GNU LD command-line options

Is it possible to create a basic bare-metal Assembly bootup/startup program using only GNU LD command-line options in lieu of a customary -T scriptfile for a Cortex-M4 target?
I have reviewed the GNU LD documentation and searched various locations including this site; however, I have not found any information suggesting that the exclusive use of command-line options for the GNU linker is possible or not possible.
My attempt to manage the object file layout without a customary vendor provided *.ld scriptfile is purely academic. This not homework. I'm not requesting any help for writing the startup Assembly code. I'm merely looking for a definitive answer or further resource direction.
$ arm-none-eabi-ld bootup.o -o bootup #bootup.ld.cli.file
Sample bootup.ld.cli.file content
--entry 0x0
--Ttext=0x0
--section-start .isr_vector=0x0
--section-start _start=0x4
--section-start .MyCode=0x8c
--Tdata=0x20000000
--Tbss=0x20000000
-M=bootup.map
--print-gc-sections
you have your answer right there the -Ttext=number -Tdata=number and so on are no gnu linker script items they are gnu command line items. note the at sign on your command line.
A gnu linker script looks more like this (although most are significantly more complicated even if they dont need to be).
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
Note that the gnu linker is a bit funny when you use the -Ttext=address approach, sometimes it will insert gaps you might have a few Kbytes of program and instead of it just linearly placing it at address like it should it will put some, then pad some dead space, then put some more, never figured out why but for extremely limited targets the linker script (vs command line) all other factors held constant, does not put the gap in the output.
EDIT:
so.s
.cpu cortex-m0
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang
.word hang
.thumb_func
reset:
b hang
.thumb_func
hang: b .
flash.s
.cpu cortex-m0
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang
.word hang
.word hang
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.thumb_func
.globl dummy
dummy:
bx lr
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
blinker02.c
void dummy ( unsigned int );
int notmain ( void )
{
unsigned int ra;
for(ra=0;ra<100;ra++) dummy(ra);
return(0);
}
Makefile
ARMGNU = arm-none-eabi
AOPS = --warn -mcpu=cortex-m0
COPS = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=cortex-m0
all : blinker02.bin sols.bin socl.bin
clean:
rm -f *.bin
rm -f *.o
rm -f *.elf
rm -f *.list
so.o : so.s
$(ARMGNU)-as $(AOPS) so.s -o so.o
flash.o : flash.s
$(ARMGNU)-as $(AOPS) flash.s -o flash.o
blinker02.o : blinker02.c
$(ARMGNU)-gcc $(COPS) -mthumb -c blinker02.c -o blinker02.o
blinker02.bin : flash.ld flash.o blinker02.o
$(ARMGNU)-ld -o blinker02.elf -T flash.ld flash.o blinker02.o
$(ARMGNU)-objdump -D blinker02.elf > blinker02.list
$(ARMGNU)-objcopy blinker02.elf blinker02.bin -O binary
sols.bin : so.o
$(ARMGNU)-ld -o sols.elf -T flash.ld so.o
$(ARMGNU)-objdump -D sols.elf > sols.list
$(ARMGNU)-objcopy sols.elf sols.bin -O binary
socl.bin : so.o
$(ARMGNU)-ld -o socl.elf -Ttext=0x08000000 -Tbss=0x20000000 so.o
$(ARMGNU)-objdump -D socl.elf > socl.list
$(ARMGNU)-objcopy socl.elf socl.bin -O binary
The difference between the command line and the linker script socl and sols list files are the names
diff sols.list socl.list
2c2
< sols.elf: file format elf32-littlearm
---
> socl.elf: file format elf32-littlearm
Not going to bother with demonstrating the difference you may see down the road.
For assembly only you dont need to worry about the no start files and other command line options (on gcc). With C objects you do. by not allowing the linker to use the as-built/configured toolchains (or lets say C library) bootstrap code, you have to provide one, if you dont complicate the linker script to the point that specific object files are called out then the ordering of objects on the command line matters, if you swap flash.o and blinker02.o on the ld command line in the makefile, the binary wont work. you can set entry points all you want but those are strictly for the loader, if this is bare metal which it appears to be then the entry point is useless, the hardware boots how it boots, in this case with a cortex-m address zero is the value to load in the stack pointer, address four is the address to the reset vector (with the lsbit set since this is a thumb only machine, let the tools do that for you using the gnu assembler specific thumb_func to indicate the next label is a branch destination address).
I sprinkled cortex-m0 about one because that is what I took this code from and two the original armv4t and armv5t or as called out in the newer arm docs "all thumb variants", is the most portable arm instruction set across the arm cores. with your cortex-m4 you can get rid of that or perhaps make it a -m3 or -m4 to pull in the armv7-m thumb2 extensions.
so the short answer is
arm-none-eabi-ld -o so.elf -Ttext=0x08000000 -Tbss=0x20000000 so.o
Is more than adequate for making working binaries ASSUMING you dont need a .data.
.data requires a lot more stuff, linker script, a more complicated bootstrap, etc. That or you do a copy-jump thing, compile the REAL program to be run in sram only (different entry point full sized arm style but at the ram base address), then write an adhoc tool to take that binary and turn it into say .word 0xabcdef entries in a program that copies from flash to ram the whole REAL program then branches, that copy and jump program is now flash only with no .data nor .bss really needed and can use the command line, so can the REAL ram only program. And I probably lost you already on that one.
Likewise, using the command line you cannot or should not assume that .bss is zeroed, your bootstrap has to do that too. Now if you have .bss and no .data, then sure you could blindly zero all of the ram on boot before you branch to your C programs entry point (I use notmain() both because at least one old compiler added unnecessary garbage to the binary if it saw a main() function and to emphasize the point that normally there is nothing magic about the function named main().).
Linker scripts are toolchain specific, so no reason to expect gnu linker scripts to port to Kiel to port to ARM (yes I know ARM owns Kiel now was referring to RVCT or whatever it is now), etc. So that is the first .data/.bss problem. Ideally you want your tools to do the work, so they know how bit .data and .bss are so just let them tell you, how you let them tell you is crafting the linker script right (at least with ld) and that is tricky, but it creates variables if you will that can define things like start address for .bss, end address for .bss maybe even some math to subtract them and get length, likewise for .data, then in the bootstrap assembly language you can zero out the .bss memory using start address and length, and/or start address and end address. For .data you need two addresses, where you put it in flash (more linker script foo) and where it wants to go in ram, and the length then the bootstrap copies.
so basically if you write this code
unsigned int x=5;
unsigned int y;
and you use a command line linker script, there is no reason whatsoever to expect x to be 5 or y to be 0 when the first C function is entered that uses those variables. If you assume that x will be a 5 then your program will fail.
if you do this instead
unsigned int x;
unsigned int y;
void myfun ( void )
{
x=5;
y=0;
}
now those assignments are instructions in .text and not values in .data so it will always work command line or not simple linker script or complicated, etc.

Query on -ffunction-section & -fdata-sections options of gcc

The below mentioned in the GCC Page for the function sections and data sections options:
-ffunction-sections
-fdata-sections
Place each function or data item into its own section in the output file if the target supports arbitrary sections. The name of the function or the name of the data item determines the section's name in the output file.
Use these options on systems where the linker can perform optimizations to improve locality of reference in the instruction space. Most systems using the ELF object format and SPARC processors running Solaris 2 have linkers with such optimizations. AIX may have these optimizations in the future.
Only use these options when there are significant benefits from doing so. When you specify these options, the assembler and linker will create larger object and executable files and will also be slower. You will not be able to use gprof on all systems if you specify this option and you may have problems with debugging if you specify both this option and -g.
I was under the impression that these options will help in reducing the executable file size. Why does this page say that it will create larger executable files? Am I missing something?
Interestingly, using -fdata-sections can make the literal pools of your functions, and thus your functions themselves larger. I've noticed this on ARM in particular, but it's likely to be true elsewhere. The binary I was testing only grew by a quarter of a percent, but it did grow. Looking at the disassembly of the changed functions it was clear why.
If all of the BSS (or DATA) entries in your object file are allocated to a single section then the compiler can store the address of that section in the functions literal pool and generate loads with known offsets from that address in the function to access your data. But if you enable -fdata-sections it puts each piece of BSS (or DATA) data into its own section, and since it doesn't know which of these sections might be garbage collected later, or what order the linker will place all of these sections into the final executable image, it can no longer load data using offsets from a single address. So instead, it has to allocate an entry in the literal pool per used data, and once the linker has figured out what is going into the final image and where, then it can go and fix up these literal pool entries with the actual address of the data.
So yes, even with -Wl,--gc-sections the resulting image can be larger because the actual function text is larger.
Below I've added a minimal example
The code below is enough to see the behavior I'm talking about. Please don't be thrown off by the volatile declaration and use of global variables, both of which are questionable in real code. Here they ensure the creation of two data sections when -fdata-sections is used.
static volatile int head;
static volatile int tail;
int queue_empty(void)
{
return head == tail;
}
The version of GCC used for this test is:
gcc version 6.1.1 20160526 (Arch Repository)
First, without -fdata-sections we get the following.
> arm-none-eabi-gcc -march=armv6-m \
-mcpu=cortex-m0 \
-mthumb \
-Os \
-c \
-o test.o \
test.c
> arm-none-eabi-objdump -dr test.o
00000000 <queue_empty>:
0: 4b03 ldr r3, [pc, #12] ; (10 <queue_empty+0x10>)
2: 6818 ldr r0, [r3, #0]
4: 685b ldr r3, [r3, #4]
6: 1ac0 subs r0, r0, r3
8: 4243 negs r3, r0
a: 4158 adcs r0, r3
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
10: R_ARM_ABS32 .bss
> arm-none-eabi-nm -S test.o
00000000 00000004 b head
00000000 00000014 T queue_empty
00000004 00000004 b tail
From arm-none-eabi-nm we see that queue_empty is 20 bytes long (14 hex), and the arm-none-eabi-objdump output shows that there is a single relocation word at the end of the function, it's the address of the BSS section (the section for uninitialized data). The first instruction in the function loads that value (the address of the BSS) into r3. The next two instructions load relative to r3, offsetting by 0 and 4 bytes respectively. These two loads are the loads of the values of head and tail. We can see those offsets in the first column of the output from arm-none-eabi-nm. The nop at the end of the function is to word align the address of the literal pool.
Next we'll see what happens when -fdata-sections is added.
arm-none-eabi-gcc -march=armv6-m \
-mcpu=cortex-m0 \
-mthumb \
-Os \
-fdata-sections \
-c \
-o test.o \
test.c
arm-none-eabi-objdump -dr test.o
00000000 <queue_empty>:
0: 4b03 ldr r3, [pc, #12] ; (10 <queue_empty+0x10>)
2: 6818 ldr r0, [r3, #0]
4: 4b03 ldr r3, [pc, #12] ; (14 <queue_empty+0x14>)
6: 681b ldr r3, [r3, #0]
8: 1ac0 subs r0, r0, r3
a: 4243 negs r3, r0
c: 4158 adcs r0, r3
e: 4770 bx lr
...
10: R_ARM_ABS32 .bss.head
14: R_ARM_ABS32 .bss.tail
arm-none-eabi-nm -S test.o
00000000 00000004 b head
00000000 00000018 T queue_empty
00000000 00000004 b tail
Immediately we see that the length of queue_empty has increased by four bytes to 24 bytes (18 hex), and that there are now two relocations to be done in queue_empty's literal pool. These relocations correspond to the addresses of the two BSS sections that were created, one for each global variable. There need to be two addresses here because the compiler can't know the relative position that the linker will end up putting the two sections in. Looking at the instructions at the beginning of queue_empty, we see that there is an extra load, the compiler has to generate separate load pairs to get the address of the section and then the value of the variable in that section. The extra instruction in this version of queue_empty doesn't make the body of the function longer, it just takes the spot that was previously a nop, but that won't be the case in general.
When using those compiler options, you can add the linker option -Wl,--gc-sections that will remove all unused code.
You can use -ffunction-sections and -fdata-sections on static libraries, which will increase the size of the static library, as each function and global data variable will be put in a separate section.
And then use -Wl,--gc-sections on the program linking with this static library, which will remove unused sections.
Thus, the final binary will be smaller than without those flags.
Be careful though, as -Wl,--gc-sections can break things.
I get better results adding an additional step and building an .a archive:
first, gcc and g++ are used with -ffunction-sections -fdata-sections flags
then, all .o objects are put into an .a archive with ar rcs file.a *.o
finally, the linker is called with -Wl,-gc-sections,-u,main options
for all, optimisation is set to -Os.
I tried it a while back and looking at the results it seems the size increase comes from the order of objects with different alignment. Normaly the linker sorts objects to keep the padding between them small but it looks like that only works within a section, not across the individual sections. So you often get extra padding between the data sections for each function increasing the overall space.
For a static lib with -Wl,-gc-sections the removal of unused section will most likely make more than up for the small increase though.

Resources