ARM THUMB mode issue on Cortex A15 - arm

we are using cortex A15, and kernel 3.8.
If I compile
arm-gcc-4.7.3 test.c -o test_thumb -mthumb
In Kernel if I set CONFIG_ARM_THUMB or unset. my THUMB(user space) always run,
So i could not understand the behavior.

Ok, so, I can't see a good reason to do what you're attempting to do ... so I'll assume you are asking out of pure curiosity.
It is not possible (in the processor) to disable decoding Thumb instructions or switching to Thumb state. The CONFIG_ARM_THUMB option is about making the use of Thumb code in applications safe with regards to how the operating system acts. This means, on the theoretical level, that not having this disabled could mean that in certain situations the program would not work properly - not that it would prevent actively Thumb code from executing.
In practise, the main effect it ever had was with OABI, which used an embedded value in the SWI (now SVC) instruction to identify which system call it was requesting.
I think OABI is not even supported by latest versions of GCC/binutils...
Any 4.7 toolchain is highly likely to be EABI.

Related

GNU Arm Embedded Toolchain | arm-none-eabi-gcc options: What is a difference between Thumb (-mthumb) and Arm (-marm) state?

I have a maybe trivial question, but what is a difference between Thumb (-mthumb) and Arm (-marm) state and why most of the tutorials recommend to use Thumb state?
I am curious what exactly does it mean? What it is related to?
Best!
I would suggest to read those two articles, one from Arm, Instruction Set Architecture (-marm means that GCC will generate arm32/A32 code, -mthumb means that it will generate thumb/T32 one), and this research paper, Profile Guided Selection of ARM and ThumbInstructions.
Basically, the two instruction sets differ in the set of instructions available as well as their encoding. You should therefore get a smaller and faster executable by using thumb/T32 than by using arm/A32.
This is the reason why most of the tutorials recommend to use the thumb/T32 instruction set.

How to stop GCC from breaking my NEON intrinsics?

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?
Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.
float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
f32_0 = vld1q_f32(&s[x]);
f32_1 = vld1q_f32(&s[x+4]);
f32_2 = vld1q_f32(&s[x+8]);
f32_3 = vld1q_f32(&s[x+12]);
__builtin_prefetch(&s[x+64]);
f32_0 = vnegq_f32(f32_0);
f32_1 = vnegq_f32(f32_1);
f32_2 = vnegq_f32(f32_2);
f32_3 = vnegq_f32(f32_3);
vst1q_f32(&d[x], f32_0);
vst1q_f32(&d[x+4], f32_1);
vst1q_f32(&d[x+8], f32_2);
vst1q_f32(&d[x+12], f32_3);
}
This is the code it generates:
vld1.32 {d18-d19}, [r5]
vneg.f32 q9,q9 <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8
More info:
GCC 4.9.2 for Raspbian
compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.
Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.
Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.
To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.
Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.
So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.
To improve on this you can try:
Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)
Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.
Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

Script/Tool predicate for ARM ELF compiled for Thumb OR Arm

I have rootfs and klibc file systems. I am creating make rules and some developers have an older compiler without inter-networking.note1 I am trying to verify that all the files get built with arm only when a certain version of the compiler is detected. I have re-built the tree's several times. I was using readelf -A and looking for Tag_THUMB_ISA_use: Thumb-1, but this seem to be in arm only code (but was built with the interworking compiler) as well as thumb code. I can manually run objdump -S and examine the assembler to determine what instruction set is in use.
However, it would be much easier if I had a script/tool predicate so that find, etc can be used to search through the shadow file systems to look for binaries that may have been missed. I thought that some of this information would be in the ELF header and accessible via objdump or readelf, but I haven't found anything reliable.
Specifically I am looking for,
Compiled 'C' that wouldn't run without a CONFIG_ARM_THUMB Linux system.
make rules that use 'C' compiler flags that choke a non-thumb compilers.
note1: Interworking allow easy switching between thumb and arm modes, and the compiler will automatically generate code to support calling from either mode.
The readelf -A output doesn't describe the elf contents. It just describes the capabilities of the processor and or system that is expected or fed to the compiler. As I have an ARM926 CPU which is an ARMV5TEJ processor, gcc/ld will always set Tag_THUMB_ISA_use: Thumb-1 as it just means that ARMV5TEJ is recognized as being Thumb-1 capable. It says nothing about the code itself.
Examining the Linux arch/arm/kernel/elf.c routine elf_check_arch() shows a check for x->e_entry & 1. This leads to the following script,
readelf -h $1 | grep -q Entry.*[13579bdf]$
Ie, just look at the initial ELF entry value and see if the low bit is set. This is a fast check that fits the spirit of what I am looking for. unixsmurf has a good point that the code inside any ELF can mix and match ARM and Thumb. This maybe ok, if the program dynamically ids the CPU and selects an appropriate routine. Ie, just the presence of a Thumb instruction doesn't mean that code will execute.
Just looking at the entry value does determine which gcc compiler flags were used, at least for gcc versions 4.6 to 4.7.
Since thumb and arm sequences can be freely interchanged within an object file, even within the same section, plain ELF header inspection is not going to help you whether a file includes Thumb instructions or not.
A slightly roundabout and still not 100% foolproof way would be to use readelf -r and check if the output contains "R_ARM_THM", indicating a relocation for thumb.

CPU features and compiler symbols

I have a question about how compiler-set symbols, in particular CPU feature flags (like SSE, AES, AVX) are actually set. For instance, if I call gcc with -mavx, is the __AVX__ symbol set regardless of whether the system the code is about to be built on actually supports AVX instructions, or does it check before?
I'm asking because I need to build a particular code path depending on CPU capabilities and would like to automate it so that the correct path is determined upon compilation based on the build system, instead of manually enabling desired features. But since the only CPU I have supports basically every feature, I cannot test my above assumption (first world problems, I know)
There is going to be a lot of code so simply keeping everything and branching at runtime is unacceptable - and it is assumed that my library will be built before being used on a given system anyway.
I mean, at worst I can force this behavior by wrapping the gcc arguments in a cpuid-aware script, but if gcc does it automatically it would be preferable. So does anyone know whether it does?
I am mostly interested in gcc's take on this but I am also curious to know how other C compilers behave.
If you pass the -mavx flag, __AVX__ will always be set for the resulting compilation (and the resulting code may not run on non-AVX machines).
If you pass the -march=native flag, gcc will enable the instruction sets supported by the build machine, so __AVX__ will only be set if the build machine supports it.

Programmatically calling debugger in GCC

Is it possible to programmatically break into debugger from GCC?
For example I want something like:
#define STOP_EXECUTION_HERE ???
which when put on some code line will force debugger stop there.
Is it possible at all ?
I found some solution, but i can't use it because on my embedded ARM system I don't have signal.h.
(However I can use inline assembly).
What you are trying to do is called software breakpoint
It is very hard to say precisely without knowing how you actually debug. I assume your embedded system runs gdbstub. There are various possibilities how this can be supported:
Use dedicated BKPT instruction
This could be a standard way on your system and debugger to support software breakpoints
Feed invalid instruction to CPU
gdbstub could have placed own UNDEF ARM mode handler placed. If you go this route you must be aware of current CPU mode (ARM or THUMB), because instruction size will be different. Examples of undefined instructions:
ARM: 0xE7F123F4
THUMB: 0xDE56
In runtime CPU mode could be found from the lowest bit of PC register. But the easier way is to know how you compiled object file, where you placed software breakpoint
Use SWI instruction
We did so when used RealView ICE. Most likely this does not apply to you, if you run some OS on your embedded system. SWI is usually used by OS to implement system calls
Normally, the best way to do this is via a library function supplied with your device's toolchain.
I can't test it out, but a general solution might be to insert an ARM BKPT instruction. It takes an immediate argument, which your debugger might interpret, with potentially odd results.
You could run your application from GDB, and in the code call e.g. abort. This will stop your application at that point. I'm not sure if it's possible to continue after that though.

Resources