where is aeabi_fmul being linked from? - c

I've been running code on the ARM M0+ core and i see that the vast majority of my time is spent in floating point calculations. So I am experimenting with a custom floating point calculation function for use in very low power applications.
I've been using ARM GCC for bare metal compile on an M0+ (without a hard FPU). I see that floating point multiplication gets linked to __aeabi_fmul and then linked to generate the final ELF file.
My questions are as follows:
Where is __aeabi_fmul defined? Is it in a pre-compiled library that comes with GCC?
Is it possible to change this definition in some way? Maybe have a pre-compiled version of my_fp_mul instead and link to that instead of __aeabi_fmul?
I understand that the second part needs me to mess with the compiler. I've been looking into CLANG/LLVM to do this since general consensus seems to be that its easier to modify than GCC! I'm just trying to see if this is even something thats possible or im barking up the entirely wrong tree here.
thank you

It is part of gcc, the gcc library, download the gcc sources and search for those functions and you will find them. They are soft float routines and are hand tuned and you are unlikely to do a significantly better job, but knock yourself out. Not sure why you would do any floating point on an MCU like that but thankfully the language and the tools allow you although it can consume a lot of flash and execution time. (not doing any float variables but doing the floating point math yourself with fixed point is a possible compromise or just do fixed point).
If you use gcc to link then gcc knows where the libraries are and will pull them in automatically, if you use ld to link (using gcc just as a compiler not the caller of everything in the toolchain) then ld does not know where to find the libraries and you can simply add your own object on the command line, this is the simplest way.
You can take the as-is gnu source for a particular function and add it to your project then modify it or just completely replace it with your own function.
Naturally you can go into the compiler sources and rename things then re-build the compiler, not sure just how much work you want to do here, replacing the floating point routines without mistakes is already a large task, as mentioned in comments I would leave the compiler alone and just work with it (leave the names the same link with ld).
start.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.word hang
.thumb_func
reset:
bl notmain
.thumb_func
hang: b .
so.c
float notmain ( float a, float b )
{
return(a+b);
}
memmap
MEMORY
{
rom : ORIGIN = 0x00000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > rom
}
build
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -Xlinker -T -Xlinker memmap -nostdlib -nostartfiles -ffreestanding -mthumb start.o so.c -o so.elf -lgcc
arm-none-eabi-objdump -D so.elf
it doesnt complain but makes a perfectly broken binary
20000048 <__addsf3>:
20000048: e1b02080 lsls r2, r0, #1
2000004c: 11b03081 lslsne r3, r1, #1
20000050: 11320003 teqne r2, r3
20000054: 11f0cc42 mvnsne r12, r2, asr #24
20000058: 11f0cc43 mvnsne r12, r3, asr #24
2000005c: 0a000047 beq 20000180 <__addsf3+0x138>
20000060: e1a02c22 lsr r2, r2, #24
20000064: e0723c23 rsbs r3, r2, r3, lsr #24
20000068: c0822003 addgt r2, r2, r3
2000006c: c0201001 eorgt r1, r0, r1
20000070: c0210000 eorgt r0, r1, r0
those are arm instructions not thumb. examining what the linker was passed.
0:[/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/../../../../arm-none-eabi/bin/ld]
1:[-plugin]
2:[/opt/gnuarm/libexec/gcc/arm-none-eabi/7.1.0/liblto_plugin.so]
3:[-plugin-opt=/opt/gnuarm/libexec/gcc/arm-none-eabi/7.1.0/lto-wrapper]
4:[-plugin-opt=-fresolution=/tmp/ccSyISCJ.res]
5:[-X]
6:[-o]
7:[so.elf]
8:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/thumb]
9:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0]
10:[-L/opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/../../../../arm-none-eabi/lib]
11:[-T]
12:[memmap]
13:[start.o]
14:[/tmp/ccrdRU2s.o]
15:[-lgcc]
the other approach
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T memmap start.o so.o /opt/gnuarm/lib/gcc/arm-none-eabi/7.1.0/thumb/libgcc.a -o so.elf
but this is still broken
20000038 <__addsf3>:
20000038: e1b02080 lsls r2, r0, #1
2000003c: 11b03081 lslsne r3, r1, #1
20000040: 11320003 teqne r2, r3
20000044: 11f0cc42 mvnsne r12, r2, asr #24
20000048: 11f0cc43 mvnsne r12, r3, asr #24
2000004c: 0a000047 beq 20000170 <__addsf3+0x138>
20000050: e1a02c22 lsr r2, r2, #24
20000054: e0723c23 rsbs r3, r2, r3, lsr #24
I have not done the things I need to do to get the right library, have to run will re-edit this later...
But my proposed solution is:
.thumb_func
.globl __aeabi_fadd
__aeabi_fadd:
bx lr
I added to start.s for demonstration purposes
arm-none-eabi-as start.s -o start.o
arm-none-eabi-ld -T memmap start.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf
Disassembly of section .text:
20000000 <_start>:
20000000: 20001000 andcs r1, r0, r0
20000004: 20000015 andcs r0, r0, r5, lsl r0
20000008: 20000019 andcs r0, r0, r9, lsl r0
2000000c: 20000019 andcs r0, r0, r9, lsl r0
20000010: 20000019 andcs r0, r0, r9, lsl r0
20000014 <reset>:
20000014: f000 f802 bl 2000001c <notmain>
20000018 <hang>:
20000018: e7fe b.n 20000018 <hang>
2000001a <__aeabi_fadd>:
2000001a: 4770 bx lr
2000001c <notmain>:
2000001c: b510 push {r4, lr}
2000001e: f7ff fffc bl 2000001a <__aeabi_fadd>
20000022: bc10 pop {r4}
20000024: bc02 pop {r1}
20000026: 4708 bx r1
then fill in whatever you want, clearly this is not a real program, broke many rules, there are no numbers being passed in, etc...
But the compiler generated __aeabi_fadd and I supplied an __aeabi_fadd and it was happy.
What I have done in the past, is, since I build my own gnu toolchain anyway, go in and put a syntax error in the file of interest, do the build, then the long command line used to build that item is now on the screen when it fails, isolate the function of interest, use the long command line for gcc as a guide, tweak and tune as desired...Get there faster than trying to figure out all the defines on your own in the code.

Related

qemu-arm with Cortex-M4 on Linux

I am using qemu-arm and the ARM Workbench IDE to run/profile an ARM binary which was built with armcc/armlink (an .axf-File, program written in C). This works fine with Cortex-A9 and ARM926/ARM5TE. However, whatever I tried, it doesnt work when the binary is built for Cortex-M4. Both the simulator and qemu-arm hang when M4 is selected as CPU.
I know that this processor requires some additional startup code, but I could find any comprehensive tutorial on how to get it running. Does anyone know how to do this? I have a quite big project with one main function, but it would already help if a "hello world" or some simple program which takes arguments would run.
Here is the command line I am using with Cortex-A9:
qemu-system-arm -machine versatileab -cpu cortex-a9 -nographic -monitor null -semihosting -append 'some program arguments' -kernel program.axf
I do not know how to do it with the versatilepb, it did not "just work", but this does work:
flash.s
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
notmain.c
void PUT32 ( unsigned int, unsigned int );
#define UART0BASE 0x4000C000
int notmain ( void )
{
unsigned int rx;
for(rx=0;rx<8;rx++)
{
PUT32(UART0BASE+0x00,0x30+(rx&7));
}
return(0);
}
flash.ld
ENTRY(_start)
MEMORY
{
rom : ORIGIN = 0x00000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
(I am told the entry point being a thumb function address is critical YMMV)
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m3 -mthumb -c notmain.c -o notmain.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy -O binary notmain.elf notmain.bin
check the vector table, etc.
00000000 <_start>:
0: 20001000
4: 0000000d
8: 00000013
0000000c <reset>:
c: f000 f804 bl 18 <notmain>
10: e7ff b.n 12 <hang>
00000012 <hang>:
12: e7fe b.n 12 <hang>
Looks good.
And run it
qemu-system-arm -M lm3s811evb -m 8K -nographic -kernel notmain.bin
01234567
Then ctrl-a then x to exit
QEMU: Terminated
-cpu cortex-m4 works as well as one would expect. Would have to try to find things different between the m3 and m4 that might show up in a sim like this and go from there.
After Luminary Micro (acquired by ti a while ago now) I do not think anyone else put the effort in for a machine. But as already discussed in at least one question at this site, you can run the cores (an exercise for the reader).
For versatilepb
int notmain ( void )
{
unsigned int ra;
for(ra=0;;ra++)
{
ra&=7;
PUT32(0x101f1000,0x30+ra);
}
return(0);
}
qemu-system-arm -machine versatileab -cpu cortex-m4 -nographic -monitor null -kernel notmain.elf
qemu-system-arm: This board cannot be used with Cortex-M CPUs
You can't arbitrarily plug different CPU types into an Arm board model. If you try it then the resulting system may work by luck, or may crash, or have odd behaviour; in some cases the -cpu option will just be ignored. This is because the CPU integration with the board matters: things like interrupt controllers are part of the board, not the CPU, but not all CPUs will work with all interrupt controllers. Often QEMU is not as good as it could be about detecting and reporting errors for user options that aren't valid.
In this case you're probably using an older QEMU: newer ones will correctly report:
qemu-system-arm: This board cannot be used with Cortex-M CPUs
if you try to use '-machine versatilepb' with '-cpu cortex-m4'. Older ones would either crash or just misbehave.
Generally the best thing is to use the CPU type that the board has by default (ie don't specify a -cpu option), for every board type except the "virt" board. If you want to write code for a Cortex-M4, you should look for a board type that has a Cortex-M4. The mps2-an386 is probably a good option. (If your QEMU doesn't have that board type, upgrade to a newer one: there have been a lot of M-profile emulation bug fixes anyway that you'll want to have.)

How to use the enhanced multiplier instructions of ARMv5TE instruction set

I'm using an ARM966E-S RISC-CPU and was wondering how to use the apparently available instruction set extensions for better DSP performance, e. g. an enhanced multiplier instruction.
I've read in the technical reference manual that these instruction set extensions are available but I don't know how to use/activate them.
Can anybody help?
Thanks in advance!
Why not just try it? Or read the manual for your toolchain, for example with gcc
so.s
ldrd r0,[r2]
ldr r2,[r2]
test
arm-none-eabi-as so.s -o so.o
arm-none-eabi-as -march=armv5t so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: selected processor does not support `ldrd r0,[r2]' in ARM mode
arm-none-eabi-as -march=armv5te so.s -o so.o
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <.text>:
0: e1c200d0 ldrd r0, [r2]
4: e5922000 ldr r2, [r2]

Programming STM32F4x WITHOUT IDE on Debian

This is my first question on this website, and I'm not sure about my English..
I want to know if there is a way to program the Nucleo STM32F446RE (via USB, not via JTAG) WITHOUT using any IDE.
For training purpose, I want to program with only a text editor (i use kate), Makefiles and command line.
What I already found/installed:
gcc-arm-none-eabi (6-2017-q2-update)
It contains, i think, all we need to compile (but I don't think there is a asm compiler in there).
There is example of code in C, and makefiles (that I don't totally understand). It seems to compile well (I tried the "minimum" example).
Here is the example I used:
#ifndef __NO_SYSTEM_INIT
void SystemInit()
{}
#endif
void main()
{
for (;;);
}
And here is the Makefile:
# Selecting Core
CORTEX_M=4
# Use newlib-nano. To disable it, specify USE_NANO=
USE_NANO=--specs=nano.specs
# Use seimhosting or not
USE_SEMIHOST=--specs=rdimon.specs
USE_NOHOST=--specs=nosys.specs
CORE=CM$(CORTEX_M)
BASE=../..
# Compiler & Linker
CC=arm-none-eabi-gcc
CXX=arm-none-eabi-g++
# Options for specific architecture
ARCH_FLAGS=-mthumb -mcpu=cortex-m$(CORTEX_M)
# Startup code
STARTUP=$(BASE)/startup/startup_ARM$(CORE).S
# -Os -flto -ffunction-sections -fdata-sections to compile for code size
CFLAGS=$(ARCH_FLAGS) $(STARTUP_DEFS) -Os -flto -ffunction-sections -fdata-sections
CXXFLAGS=$(CFLAGS)
# Link for code size
GC=-Wl,--gc-sections
# Create map file
MAP=-Wl,-Map=$(NAME).map
NAME=minimum
STARTUP_DEFS=-D__STARTUP_CLEAR_BSS -D__START=main
LDSCRIPTS=-L. -L$(BASE)/ldscripts -T nokeep.ld
LFLAGS=$(USE_NANO) $(USE_NOHOST) $(LDSCRIPTS) $(GC) $(MAP)
$(NAME)-$(CORE).axf: $(NAME).c $(STARTUP)
$(CC) $^ $(CFLAGS) $(LFLAGS) -o $#
clean:
rm -f $(NAME)*.axf *.map *.o
I modified it in order to set cortex-m4 instead of cortex-m0.
After running the make command I get minimum.map and minimum.axf files.
But I don't know how to load the object code in the device. ( and is it normal not to have a minimum.o file ? )
I would call something like this a minimal example with C code, the infinite loop is not necessary in this case, but is inspired by yours.
vectors.s
.thumb
.globl _start
_start:
.word 0x20002000
.word reset
.word done
.word done
.thumb_func
reset:
bl centry
b done
.thumb_func
done:
b done
so.c
void centry ( void )
{
for(;;) continue;
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
build
arm-none-eabi-as vectors.s -o vectors.o
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T flash.ld vectors.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
examine
08000000 <_start>:
8000000: 20002000 andcs r2, r0, r0
8000004: 08000011 stmdaeq r0, {r0, r4}
8000008: 08000017 stmdaeq r0, {r0, r1, r2, r4}
800000c: 08000017 stmdaeq r0, {r0, r1, r2, r4}
08000010 <reset>:
8000010: f000 f802 bl 8000018 <centry>
8000014: e7ff b.n 8000016 <done>
08000016 <done>:
8000016: e7fe b.n 8000016 <done>
08000018 <centry>:
8000018: e7fe b.n 8000018 <centry>
800001a: 46c0 nop ; (mov r8, r8)
Likely not required but read the docs, folks use 0x08000000, technically it is 0x00000000, the stm32 family maps 0x08000000 to 0x00000000 as described in the documentation based on the boot pins. Inspection needs to show that the vector table is the first thing, you have told the toolchain these are thumb addresses in the vector table (lsbit is set). Could have put the C entry function (main() is not required, that is just a convention) in the vector table as the reset function. I have no .data nor .bss initialization so something like this would not allow the use of .data nor assuming .bss variables are zero, have to write before you read. Adding more code to the bootstrap (and linker script) would allow for that.
arm-none-eabi-objcopy so.elf -O binary so.bin
Will create a binary that depending on the tools you use may be used to load the program. If this is a nucleo board you can copy that file to the virtual usb drive. Clearly this program wont show anything interesting. Using openocd or other SWD debugger software (if you have a nucleo board you dont need any other hardware) you can stop and restart the program to try to see it running.
You can read the documentation to see the addresses and how to program the peripherals.
thumb2 is just extensions to thumb, you can stick with traditional thumb or add cortex-m4 or armv7m to the command line (cpu/arch) to try to reduce the number of instructions but trade off for larger instructions.
there are no doubt tools out there but it is fairly easy to write your own program to interface with the serial bootloader to download your program into the device.

ELLCC embedded LLVM compilation fails with certain asm instructions against Thumb2 Cortex-M0

Instructions which are known to be valid, successfully used in Gnu G++ are causing some errors here against a Freescale MKL16Z Cortex-M0+ Thumb2
The code:
/* setup the stack before we attempt anything else
skip stack setup if __SP_INIT is 0
assume sp is already setup. */
__asm (
"mov r0,%0\n\t"
"cmp r0,#0\n\t"
"beq skip_sp\n\t"
"mov sp,r0\n\t"
"sub sp,#4\n\t"
"mov r0,#0\n\t"
"mvn r0,r0\n\t"
"str r0,[sp,#0]\n\t"
"add sp,#4\n\t"
"skip_sp:\n\t"
::"r"(addr));
the compile command:
ecc -target thumb-linux-engeabi -mtune=cortex-m0plus -mcpu=cortex-m0plus -mthumb -O2 -fmessage-length=0 -fsigned-char -ffunction-sections -fdata-sections -Wall -Wconversion -Wpointer-arith -Wshadow -Wfloat-equal -g3 -I"[redacted]" -I"[redacted]" -I"[redacted]" -I"[redacted]" -std=c99 -MMD -MP -MF"Project_Settings/Startup_Code/startup.d" -MT"Project_Settings/Startup_Code/startup.o" -c -o "Project_Settings/Startup_Code/startup.o" "../Project_Settings/Startup_Code/startup.c"
../Project_Settings/Startup_Code/startup.c:209:17: error: instruction requires: arm-mode
"sub sp,#4\n\t"
^
<inline asm>:6:2: note: instantiated into assembly here
mov r0,#0
^
../Project_Settings/Startup_Code/startup.c:210:17: error: invalid instruction
"mov r0,#0\n\t"
^
<inline asm>:7:2: note: instantiated into assembly here
mvn r0,r0
^~~
2 errors generated.
make: *** [Project_Settings/Startup_Code/startup.o] Error 1
Tips appreciated! I wonder if I can rewrite the asm into something using simpler instructions… it doesn’t seem to like immediate value encoding? I’m not skilled with assembly.
I think that what the compiler tells is that the instruction does not exist in Thumb and only exists in ARM.
In Thumb, almost all data-processing instructions update the flags. It means that:
MOV r0, #0
Does not exist, but is instead:
MOVS r0, #0 ; Update NZCV flags
Replacing mov with movs and mvn with mvns got things compiling. The binary was 6 times larger than g++ for some unknown reason. It seems that LLVM does not yet have an accurate representation of available instructions on Cortex-M0+ devices.

Query on -ffunction-section & -fdata-sections options of gcc

The below mentioned in the GCC Page for the function sections and data sections options:
-ffunction-sections
-fdata-sections
Place each function or data item into its own section in the output file if the target supports arbitrary sections. The name of the function or the name of the data item determines the section's name in the output file.
Use these options on systems where the linker can perform optimizations to improve locality of reference in the instruction space. Most systems using the ELF object format and SPARC processors running Solaris 2 have linkers with such optimizations. AIX may have these optimizations in the future.
Only use these options when there are significant benefits from doing so. When you specify these options, the assembler and linker will create larger object and executable files and will also be slower. You will not be able to use gprof on all systems if you specify this option and you may have problems with debugging if you specify both this option and -g.
I was under the impression that these options will help in reducing the executable file size. Why does this page say that it will create larger executable files? Am I missing something?
Interestingly, using -fdata-sections can make the literal pools of your functions, and thus your functions themselves larger. I've noticed this on ARM in particular, but it's likely to be true elsewhere. The binary I was testing only grew by a quarter of a percent, but it did grow. Looking at the disassembly of the changed functions it was clear why.
If all of the BSS (or DATA) entries in your object file are allocated to a single section then the compiler can store the address of that section in the functions literal pool and generate loads with known offsets from that address in the function to access your data. But if you enable -fdata-sections it puts each piece of BSS (or DATA) data into its own section, and since it doesn't know which of these sections might be garbage collected later, or what order the linker will place all of these sections into the final executable image, it can no longer load data using offsets from a single address. So instead, it has to allocate an entry in the literal pool per used data, and once the linker has figured out what is going into the final image and where, then it can go and fix up these literal pool entries with the actual address of the data.
So yes, even with -Wl,--gc-sections the resulting image can be larger because the actual function text is larger.
Below I've added a minimal example
The code below is enough to see the behavior I'm talking about. Please don't be thrown off by the volatile declaration and use of global variables, both of which are questionable in real code. Here they ensure the creation of two data sections when -fdata-sections is used.
static volatile int head;
static volatile int tail;
int queue_empty(void)
{
return head == tail;
}
The version of GCC used for this test is:
gcc version 6.1.1 20160526 (Arch Repository)
First, without -fdata-sections we get the following.
> arm-none-eabi-gcc -march=armv6-m \
-mcpu=cortex-m0 \
-mthumb \
-Os \
-c \
-o test.o \
test.c
> arm-none-eabi-objdump -dr test.o
00000000 <queue_empty>:
0: 4b03 ldr r3, [pc, #12] ; (10 <queue_empty+0x10>)
2: 6818 ldr r0, [r3, #0]
4: 685b ldr r3, [r3, #4]
6: 1ac0 subs r0, r0, r3
8: 4243 negs r3, r0
a: 4158 adcs r0, r3
c: 4770 bx lr
e: 46c0 nop ; (mov r8, r8)
10: 00000000 .word 0x00000000
10: R_ARM_ABS32 .bss
> arm-none-eabi-nm -S test.o
00000000 00000004 b head
00000000 00000014 T queue_empty
00000004 00000004 b tail
From arm-none-eabi-nm we see that queue_empty is 20 bytes long (14 hex), and the arm-none-eabi-objdump output shows that there is a single relocation word at the end of the function, it's the address of the BSS section (the section for uninitialized data). The first instruction in the function loads that value (the address of the BSS) into r3. The next two instructions load relative to r3, offsetting by 0 and 4 bytes respectively. These two loads are the loads of the values of head and tail. We can see those offsets in the first column of the output from arm-none-eabi-nm. The nop at the end of the function is to word align the address of the literal pool.
Next we'll see what happens when -fdata-sections is added.
arm-none-eabi-gcc -march=armv6-m \
-mcpu=cortex-m0 \
-mthumb \
-Os \
-fdata-sections \
-c \
-o test.o \
test.c
arm-none-eabi-objdump -dr test.o
00000000 <queue_empty>:
0: 4b03 ldr r3, [pc, #12] ; (10 <queue_empty+0x10>)
2: 6818 ldr r0, [r3, #0]
4: 4b03 ldr r3, [pc, #12] ; (14 <queue_empty+0x14>)
6: 681b ldr r3, [r3, #0]
8: 1ac0 subs r0, r0, r3
a: 4243 negs r3, r0
c: 4158 adcs r0, r3
e: 4770 bx lr
...
10: R_ARM_ABS32 .bss.head
14: R_ARM_ABS32 .bss.tail
arm-none-eabi-nm -S test.o
00000000 00000004 b head
00000000 00000018 T queue_empty
00000000 00000004 b tail
Immediately we see that the length of queue_empty has increased by four bytes to 24 bytes (18 hex), and that there are now two relocations to be done in queue_empty's literal pool. These relocations correspond to the addresses of the two BSS sections that were created, one for each global variable. There need to be two addresses here because the compiler can't know the relative position that the linker will end up putting the two sections in. Looking at the instructions at the beginning of queue_empty, we see that there is an extra load, the compiler has to generate separate load pairs to get the address of the section and then the value of the variable in that section. The extra instruction in this version of queue_empty doesn't make the body of the function longer, it just takes the spot that was previously a nop, but that won't be the case in general.
When using those compiler options, you can add the linker option -Wl,--gc-sections that will remove all unused code.
You can use -ffunction-sections and -fdata-sections on static libraries, which will increase the size of the static library, as each function and global data variable will be put in a separate section.
And then use -Wl,--gc-sections on the program linking with this static library, which will remove unused sections.
Thus, the final binary will be smaller than without those flags.
Be careful though, as -Wl,--gc-sections can break things.
I get better results adding an additional step and building an .a archive:
first, gcc and g++ are used with -ffunction-sections -fdata-sections flags
then, all .o objects are put into an .a archive with ar rcs file.a *.o
finally, the linker is called with -Wl,-gc-sections,-u,main options
for all, optimisation is set to -Os.
I tried it a while back and looking at the results it seems the size increase comes from the order of objects with different alignment. Normaly the linker sorts objects to keep the padding between them small but it looks like that only works within a section, not across the individual sections. So you often get extra padding between the data sections for each function increasing the overall space.
For a static lib with -Wl,-gc-sections the removal of unused section will most likely make more than up for the small increase though.

Resources