I am using a Yagarto toolchain on Windows to compile a codebase of about 100K lines of code.
We have two development PCs. However, they each build slightly different binaries despite having the same toolchain and building the same source code.
I have checked using MD5 that we have the same compiler binaries, the same system headers, and we are compiling the same source, with the same commandline being passed to GCC, yet there are subtle differences.
Out of the 81 object files in our codebase, 77 compile exactly the same; and four have minor differences. There is no functional difference, but since we are going to be supporting the compiled binaries, I would like to get to the bottom of this issue.
The "arm-elf-gcc.exe" is dated Jul 16 2006.
The output of "arm-elf-gcc -v" is:
Using built-in specs.
Target: arm-elf
Configured with: ../gcc-4.1.1/configure --target=arm-elf --prefix=/home/yagarto/yagarto --disable-nls --disable-shared --disable-threads --with-gcc --with-gnu-ld --with-gnu-as --with-stabs --enable-languages=c,c++ --enable-interwork --enable-multilib --with-newlib --disable-libssp --disable-libstdcxx-pch --disable-libmudflap --enable-win32-registry=yagarto -v
Thread model: single
gcc version 4.1.1
Here is an example from the list files of the different generated code:
.LCB1356:
mov r7, #0
mov r5, #2
str r7, [sp, #16]
str r7, [sp, #20]
str r7, [sp, #24]
str r7, [sp, #28]
str r7, [sp, #40]
.L231:
.LCB1356:
mov r7, #0
mov r5, #2
str r7, [sp, #16]
str r7, [sp, #20]
str r7, [sp, #40]
str r7, [sp, #24]
str r7, [sp, #28]
.L231:
In the two cases, just the order of variables in the stack frame is different; all the code is the same except the variables were in a different order. ("diff" on the list files just shows up various other lines corresponding to #40 being swapped with #28 and so on).
This change is obviously harmless (although I would like to know why), but in two of the other object files, the size of the text segment is actually 4 bytes larger in one version, and as well as variables being in different order on the stack frame, there are a couple of instructions that differ.
One PC is a Intel Core 2 Duo running Windows 2000, and the other is a AMD X4 running Windows 7. Each PC reliably reproduces the same build, but one PC's build differs to the others.
Is it possible that GCC will optimize differently depending on what CPU is actually being used for the build? (not the target CPU). Or what else might cause this discrepancy?
Related
I build as little OS for a CortexM4 CPU which is able to receive compiled binaries over UART and schedule them dynamically. I want to use that feature to craft a testsuite which uploads test programs being able to directly call OS functions like memory allocation without doing a SVC. Therefor I need to cast the fixed addresses of those OS routines to function pointers. Now, casting of memory addresses resulting in wrong / non-thumb instruction code - BL is needed instead of BLX, resulting in HardFaults.
void (*functionPtr_addr)(void);
functionPtr_addr = (void (*)()) (0x0800084C);
This is the assembly when calling this function
8000838: 4b03 ldr r3, [pc, #12] ; (8000848 <idle+0x14>)
800083a: 681b ldr r3, [r3, #0]
800083c: 4798 blx r3
Is there a way to force the BL instruction for such a case? It works with inline assembly, I could write macros but it would be much cleaner do it this way.
The code gets compiled and linked, among other things, with
-mcpu=cortex-m4 -mthumb.
Toolchain:
gcc version 12.2.0 (Arm GNU Toolchain 12.2.MPACBTI-Bet1 (Build arm-12-mpacbti.16))
bl instruction is limited in range. The compiler does not know where your code will be placed so it can't know if the instruction bl can be used.
resulting in HardFaults.
The address passed to blx has to be odd on Cortex-M4 uCs to execute the code in the Thumb mode. Your address is even and the uC tries to execute ARM code not supported by this core.
I'm using Clang++ to compile for a Cortex-M0+ target, and in moving from version 14 to version 15 I've found a difference in the code generated for guard variables for local statics.
So, for example:
int main()
{
static knl::QueueN<uint32_t, 8> valQueue;
...
}
Clang-14 generates the following:
ldr r0, .LCPI0_4
ldrb r0, [r0]
dmb sy
lsls r0, r0, #31
beq .LBB0_8
Clang-15 now generates:
ldr r0, .LCPI0_4
movs r1, #2
bl __atomic_load_1
lsls r0, r0, #31
beq .LBB0_8
Why the change? Was the Clang 14 code incorrect?
EDITED TO ADD:
Note that an important consequence of this is that the second case actually requires an implementation of __atomic_load_1 to be provided from somewhere external to the compiler (e.g. -latomic), whereas the first doesn't.
EDITED TO ADD:
See https://github.com/llvm/llvm-project/issues/58184 for the LLVM devs' response to this.
Neither one is wrong. It's just that in the first version, the code to do the atomic load is inlined, and in the second version it's called as a library function instead. If you look at the code within __atomic_load_1 you will probably find it executes the exact same instructions, or equivalent ones.
Each way has pros and cons. The inline version avoids the overhead of a function call, while the library version makes it possible to select code at runtime that is best optimized for the features of the actual runtime CPU.
The difference could be a conscious design change between clang versions, or a difference in the code gen and optimization options you used, or different configuration options when your clang installation was built. Someone else might know more details about what controls this. But it isn't anything to worry about as far as proper behavior of your code.
Trying to learn more about ARM chips and after a successful blinky using assembly I now want to mix C and Assembly functions. However, any C-function I call causes a Hardfault. I think i'm missing something obvious.
I compile using gcc and these flags
-c -g -ggdb -Wall --specs=nosys.specs
I use the following libc.a libgcc.a libraries when linking
-L/usr/local/gnu-arm/arm-none-eabi/lib/thumb/v7e-m+fp/softfp -lc -L/usr/local/gnu-arm/lib/gcc/arm-none-eabi/9.2.1/thumb/v7e-m+fp/softfp
From the objdump, this is where the hardfault happens:
80004d4: d3fb bcc.n 80004ce <FillZerobss>
80004d6: f7ff ff09 bl 80002ec <SystemInit>
The chip raises a hardfault when the abobe bl is exceuted
Here is the first lines of the SystemInit function
080002ec <SystemInit>:
80002ec: e52db004 push {fp} ; (str fp, [sp, #-4]!)
80002f0: e28db000 add fp, sp, #0, 0
80002f4: e59f3014 ldr r3, [pc, #20] ; 8000310 <SystemInit+0x24>
80002f8: e3a02302 mov r2, #134217728 ; 0x8000000
80002fc: e5832008 str r2, [r3, #8]
8000300: e1a00000 nop ; (mov r0, r0)
Instead of 080002ec I end up at:
08000298 <HardFault_Handler>
I think I'm missing something quite obvious but can't see it. Any help or pointers would be appriciated.
The comments show that the question could be fixed in the meantime. This answer collects a digest of useful comments in order to have a brief answer to the question:
Idea (old_timer):
based on some clues my guess is you are running on a Cortex-M,
which cannot run ARM instructions only thumb instructions,
and which thumb instructions depends on the chip and core.
What chip/core is this?
OP (user13424266):
Thanks for everyones help and pointing me in the right direction. I added -mthumb -mthumb-interwork to GCC and it now works as expected!
Confirmation (Martin Rosenau):
#fuz I just tried it out: The GNU linker does not replace the bl by blx.
However, STM32 CPUs typically have Cortex-M cores, which do neither support
non-thumb code nor the blx instruction.
I'm trying to understand how bare metal C applications work exactly. I wrote my own startup assembly code that calls __libc_init_array, I saw it iterating over preinit_array section and calling all functions inside. As I understand gcc adds that section for some its own initialization routines that need to run before main, but then comes _init() function in the .init section.
Does gcc generate that function? Does it come from libc? Or do I have to provide one by my own? What are some good resources to learn those things?
what does symbols have to do with platform? is init_() generated by
gcc on one platform and not on another?
Yes the startup and the epiloque routines are left to the implementation, and actually gcc does not generate it.
the libc provides those sysmbols - https://github.com/bminor/newlib/blob/e0f24404b3fcfa2c332ae14c3934546c91be3f42/newlib/libc/misc/init.c
Depending on your target hardware the initialisation may be done completely different way.
Example STM32Fxxx startup.
.section .text.Reset_Handler
.weak Reset_Handler
.type Reset_Handler, %function
Reset_Handler:
ldr sp, =_estack /* Atollic update: set stack pointer */
/* Copy the data segment initializers from flash to SRAM */
movs r1, #0
b LoopCopyDataInit
CopyDataInit:
ldr r3, =_sidata
ldr r3, [r3, r1]
str r3, [r0, r1]
adds r1, r1, #4
LoopCopyDataInit:
ldr r0, =_sdata
ldr r3, =_edata
adds r2, r0, r1
cmp r2, r3
bcc CopyDataInit
ldr r2, =_sbss
b LoopFillZerobss
/* Zero fill the bss segment. */
FillZerobss:
movs r3, #0
str r3, [r2], #4
LoopFillZerobss:
ldr r3, = _ebss
cmp r2, r3
bcc FillZerobss
/* Call the clock system intitialization function.*/
bl SystemInit
/* Call static constructors */
bl __libc_init_array
/* Call the application's entry point.*/
bl main
As you see in this implementation two functions are called - one SystemInit for very low lever hardware initialisation and __libc_init_array
it is an internal initialisation of the newlib library (nowadays most common used in the bare metal projects)
The problem is if you decide to do not use standard libraries and you do not want to link any standard libraries. Some tookchains provide weak functions with just the return statement, some not. If you are getting linker problems just comment this call in the startup file or provide an empty function yourself
I have an empty program in LLVM IR:
define i32 #main(i32 %argc, i8** %argv) nounwind {
entry:
ret i32 0
}
I'm cross-compiling it on Intel x86-64 Windows for ARM Linux using ELLCC, with the following command:
ecc++ hw.ll -o hw.o -target arm-linux-engeabihf
It completes without errors and generates an ELF binary.
When I take the binary to a Raspberry Pi Model B+ (running Raspbian), I get only the following error:
Illegal instruction
I don't know how to tell what's wrong from the disassembled code. I tried other ARM Linux targets but the behavior was the same. What's wrong?
The exact same file builds, links and runs fine for other targets like i386-linux-eng, x86_64-w64-mingw32, etc (that I could test on), again using the ELLCC toolchain.
Assuming the library and startup code isn't at fault, this is what the disassembly of main itself looks like:
.text:00010188 e24dd008 sub sp, sp, #8
.text:0001018c e3002000 movw r2, #0
.text:00010190 e58d0004 str r0, [sp, #4]
.text:00010194 e1a00002 mov r0, r2
.text:00010198 e58d1000 str r1, [sp]
.text:0001019c e28dd008 add sp, sp, #8
.text:000101a0 e12fff1e bx lr
I'd guess it's choking on the movw at 0x0001018c. The movw/movt encodings which can handle full 16-bit immediate values first appeared in the ARMv6T2 version of the architecture - the ARM1176 in the original Pi models predates that, only supporting original ARMv6*.
You need to tell the compiler to generate code appropriate to the thing you're running on - I don't know ELLCC, but I'd guess from this it's fairly modern and up-to-date and thus defaulting to something newer like ARMv6T2 or ARMv7. Otherwise, it's akin to generating code for a Pentium and hoping it works on an 80486 - you might be lucky, you might not. That said, there's no good reason it should have chosen that encoding in the first place - it's not as if 0 can't be encoded in a 'classic' mov instruction...
The decadent option, however, would be to consider this a perfect excuse to replace the Pi with a Pi 2 - the Cortex-A7s in that are nice capable ARMv7 cores ;)
* Lies for clarity. I think 1176 might actually be v6K, but that's irrelevant here. I'm not sure if anything actually exists as plain ARMv6, and all the various architecture extensions are frankly a hideous mess