How to detect FPU in Cortex M? - arm

Cortex-M processors implement the CPUID register, through which it is possible to detect information about the core: part number (e.g. Cortex M7 or M4), revision and patch level (e.g. r1p2), etc.
Is there a register or a way to detect if the FPU has been implemented by the implementer? And how to detect the type of FPU (VFPv4, VFPv5-SP or VFPv5-DP)?

In the cortex-m Architecure Reference manual,
B3.2.20 Coprocessor Access Control Register, CPACR
The CPACR characteristics are:
Purpose: Specifies the access privileges for coprocessors
Usage constraints: If a coprocessor is not implemented, a write of 0b01 or 0b11 to the corresponding CPACR field reads back as 0b00.
Configurations: Always implemented
The VFP will have implemented CP10 and CP11 (decimal). If there is no VFP, then they should read back as 0b00. This would apply to a majority of Cortex-M CPUs. As a vendor can implement there own IP, it is possible that some CPU/SOC might not work as documented. It would be prudent to trap/handle the undefined instruction which will be taken if a Co-processor is not present.

Related

ARM 32bit instruction advantage over 16bit thumb instructions [duplicate]

I am a bit confused about instruction sets. There are Thumb, ARM and Thumb 2. From what I have read Thumb instructions are all 16-bit but inside the ARMv7M user manual (page vi) there are Thumb 16-bit and Thumb 32-bit instructions mentioned.
Now I have to overcome this confusion. It is said that Thumb 2 supports 16-bit and 32-bit instructions. So is ARMv7M in fact supporting Thumb 2 instructions and not just Thumb?
One more thing. Can I say that Thumb (32-bit) is the same as ARM instructions which are allso 32-bit?
Oh, ARM and their silly naming...
It's a common misconception, but officially there's no such thing as a "Thumb-2 instruction set".
Ignoring ARMv8 (where everything is renamed and AArch64 complicates things), from ARMv4T to ARMv7-A there are two instruction sets: ARM and Thumb. They are both "32-bit" in the sense that they operate on up-to-32-bit-wide data in 32-bit-wide registers with 32-bit addresses. In fact, where they overlap they represent the exact same instructions - it is only the instruction encoding which differs, and the CPU effectively just has two different decode front-ends to its pipeline which it can switch between. For clarity, I shall now deliberately avoid the terms "32-bit" and "16-bit"...
ARM instructions have fixed-width 4-byte encodings which require 4-byte alignment. Thumb instructions have variable-length (2 or 4-byte, now known as "narrow" and "wide") encodings requiring 2-byte alignment - most instructions have 2-byte encodings, but bl and blx have always had 4-byte encodings*. The really confusing bit came in ARMv6T2, which introduced "Thumb-2 Technology". Thumb-2 encompassed not just adding a load more instructions to Thumb (mostly with 4-byte encodings) to bring it almost to parity with ARM, but also extending the execution state to allow for conditional execution of most Thumb instructions, and finally introducing a whole new assembly syntax (UAL, "Unified Assembly Language") which replaced the previous separate ARM and Thumb syntaxes and allowed writing code once and assembling it to either instruction set without modification.
The Cortex-M architectures only implement the Thumb instruction set - ARMv7-M (Cortex-M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and encodings for VFP instructions, whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in the form of a handful of 4-byte system instructions.
Thus, the new 4-byte encodings (and those added later in ARMv7 revisions) are still Thumb instructions - the "Thumb-2" aspect of them is that they can have 4-byte encodings, and that they can (mostly) be conditionally executed via it (and, I suppose, that their menmonics are only defined in UAL).
* Before ARMv6T2, it was actually a complicated implementation detail as to whether bl (or blx) was executed as a 4-byte instruction or as a pair of 2-byte instructions. The architectural definition was the latter, but since they could only ever be executed as a pair in sequence there was little to lose (other than the ability to take an interrupt halfway through) by fusing them into a single instruction for performance reasons. ARMv6T2 just redefined things in terms of the fused single-instruction execution
In addition to Notlikethat's answer, and as it hints at, ARMv8 introduces some new terminology to try to reduce the confusion (of course adding even more new terminology):
There is a 32-bit execution state (AArch32) and a 64-bit execution state (AArch64).
The 32-bit execution state supports two different instruction sets: T32 ("Thumb") and A32 ("ARM"). The 64-bit execution state supports only one instruction set - A64.
All A64, like all A32, instructions are 32-bit (4 byte) in size, requiring 4-byte alignment.
Many/most A64 instructions can operate on both 32-bit and 64-bit registers (or arguably 32-bit or 64-bit views of the same underlying 64-bit register).
All ARMv8 processors (like all ARMv7 processors) that implement AArch32 support Thumb-2 instructions in the T32 instruction set.
Not all ARMv8-A processors implement AAarch32, and some don't implement AArch64. Some Processors support both, but only support AArch32 at lower exception levels.
Thumb: 16 bit instruction set
ARM: 32 bit wide instruction set hence more flexible instructions and less code density
Thumb2 (mixed 16/32 bit): somehow a compromise between ARM and thumb(16) (mixing them), to get both performance/flexibility of ARM and instruction density of Thumb. so a Thumb2 instruction can be either an ARM (only a subset of) with 32 bit wide instruction or a Thumb instruction with 16 bit wide.
It was confusing for me the Cortex M3 having 4-byte instructions, yet not executing the ARM instructions. Or CPUs capable to have 2-byte and 4-byte opcodes, but capable to execute the ARM instructions too. So I read a book about Arm and now I understand it slightly better. Still, the naming and the overlap are still confusing to me. I was thinking it would be interesting to compare a few CPUs first and then talk about the ISAs.
To compare a few CPUs and what they can do and how they overlap:
Cortex M0/M0+/M1/M23 are considered Thumb (Thumb-1) and can execute the 2-byte opcodes which are limited compared to others. However, some instructions such as mrs, msr, bl, dmb, dsb, isb are from Thumb-2 and are 4-byte. The Cortex M0/M0+/M1 are ARMv6, while Cortex M23 is ARMv8. The Thumb-1 instruction was extended in the ARMv7, so it can be said that ARMv8 Cortext M23 supports fuller Thumb-1 (except it instruction) while ARMv6 Cortex M0/M0+ only a subset of the ISA (they are missing specifically it, cbz and cbnz instructions). I might be wrong (please correct me if this is not right), but noticed something funny, that only CPUs I see which support Thumb-1 fully are CPUs that already support Thumb-2 as well, I do not know Thumb-1 only CPU which supports 100% of Thumb-1. I think it's because of the it which could be seen as Thumb-2 opcode which is 2-byte and was in essence added to Thumb-1. On the Thumb-1 CPUs the 4-byte opcodes could be seen as two 2-bytes to represent the 4-byte opcode instead.
Cortex M3/M4/M7/M33/M35P/M55 can execute 2-byte and 4-byte opcodes, both are Thumb-1 and Thumb-2 and support a full set of the ISAs. The 2-byte and 4-byte opcodes are mixed more evenly, while the Cortex M0/M0+/M1/M23 above are biased to use 2-byte opcodes most of the time. Cortex M3/M4/M7 are ARMv7, while Cortex M33/M35P/M55 are ARMv8.
Cortex A/R can accept both ARM and Thumb opcodes and therefore have 2-byte and 4-byte. To switch between the modes the PC needs to be offset by one byte (forcefully unaligned), this can be done for example with branch instruction bx which sets the T bit of the CPSR and switches the mode depending on the lowest bit of address. This works well, for example when calling subroutine the PC (and its mode) get saved, then inside the subroutine it could be switched to Thumb mode, yet when returning from Thumb mode it will restore the PC (and its T-bit) and switches back to whatever the caller was (ARM or Thumb mode) without any issue.
ARM7 only supports ARMv3 4-byte ISA
ARM7T supports both Thumb-1 and ARM ISAs (2-byte and 4-byte)
ARM11 (ARMv6, ARMv6T2, ARMv6Z, ARMv6K) supports Thumb-1, Thumb-2 and ARM ISAs
The book I referenced stated that in the ARMv7 and newer the architecture switched from Von Neumann (data and instructions sharing a bus) to Harvard (dedicated busses) to get better performance. However the absolute term "and newer" is not true, because ARMv8 is newer, yet the ARMv8 Cortex M23 is Von Neumann.
The ISAs are:
ARM has 16 registers (R0-R12, SP, LR, PC), only 4-byte opcodes, there are revisions to the ISA, but they are only 4-byte opcodes.
Thumb (aka Thumb-1) split the 16 registers to lower (R0-R7) and higher (R8-R12, SP, LR, PC), most instructions can access the lower set only, while only some can access the higher set. Only 2-byte opcodes. On low-end devices which have a 16-bit bus (and have to do 32-bit word access in two steps) perform better when they they execute 2-byte opcodes as it's matching their bus. The naming is confusing me the Thumb could be used as the family term for both Thumb-1 together with Thumb-2, or sometimes Thumb can be used for Thumb-1 only. I think the Thumb-1 is not an official Arm term, just something I have seen used by people to make the distinguishment between the Thumb family of both ISAs and the first Thumb ISA clearer. Instructions in ARM can have the optional s suffix to update the CPSR register (for example ands, orrs, movs, adds, subs instruction), while in the Thumb-1 the s is always on and it saves the CPSR register all the time. In some older toolchains the implicit s is not needed, however in the efforts of Unified Assembly Language (UAL) now it's a requirement to explicitly specify the s even when there is no option to not use the s.
Thumb-2 is an extension to Thumb and can access all registers like ARM does, has 4-byte opcodes with some differences compared to ARM. In the assembly, the Thumb-1 2-byte narrow opcode and Thumb-2 4-byte wide opcode can be forced with .n and .w postfix (example orr.w). The ARM and Thumb-2 opcode formats/encodings are different and their capabilities differ too. The conditional execution of instructions can be used, but only when it (if-then) instruction/block is prepended. This can be done explicitly or implied (and done by the toolchain behind the user's back). And the confusion might be actually good as Arm (the company) wanted them to be similar, a lot of effort went to Unified Assembly Language (UAL) so assembly files made for ARM could be compiled on Thumb-2 without change. If I understand this correctly, that can't be 100% guaranteed and some edge cases could probably be made where the ARM assembly can't compile as Thumb-2 and this is another absolute statement that is not fully true. For example the ARM7 bl instruction can address +-32MB while on Cortex M3 it can only +-16MB. The situation such be much better compared to Thumb-1 where the ARM assembly has to be more likely rewritten to target Thumb-1, while ARM to Thumb-2 rewrite is less likely to happen. Another difference are the data processing instructions. Both ARM and Thumb-2 support 8-bit immediates while ARM can rotate bits only to the right and only by even bits, while Thumb can do rotations to left and by even/odd amount of bits and on top of that allows repetitive byte patterns such as 0xXYXYXYXY, 0x00XY00XY or 0xXY00XY00. Because the shifts are rotating, the left and right shifts can be achieved by 'overflowing', shifting so much to one direction that it's effectively a shift to the opposite direction 1 << (32 - n) == 1 >> n
So in conclusion some Arm CPUs can do:
only 4-byte opcode instructions which are pure ARM ISA
2-byte/4-byte Thumb-1/Thumb-2 ISAs with a focus to use the 2-byte most of the time with only a few 4-byte opcodes, these often are labeled as Thumb (Thumb-1) 2-byte opcode CPUs (and the few 4-byte opcodes are sometimes not mentioned)
2-byte/4-byte Thumb-1/Thumb-2 ISAs and are more evenly mixed between 2-byte and 4-byte opcodes, often labeled as Thumb-2
2-byte/4-byte opcodes by switching between ARM/Thumb modes
Reference for this information: ARM Assembly Language Programming & Architecture Muhammad Ali Mazidi et al 2016. The book was written before the company name change from ARM to Arm, so sometimes it was confusing when it was referencing the company Arm and when the ARM ISA.
Please refer to https://developer.arm.com/documentation/ddi0344/c/programmer-s-model/thumb-2-instruction-set
It explains in detail about the enhancement of the Thumb2 architecture. The same covers the ARM, Thumb and Thumb2 instruction set description implicitly.

FPU version for Cortex-M microcontrollers

From a simple google search, I found out that the fpu version for Tiva C Launchpad is fpv4-sp-d16 but which document tells the fpu version of various microcontrollers(tm4c123gh6pm, stm32f407, stm32f446re, etc.)?
arm-none-eabi-gcc --print-multi-lib
gives the information about architecture and abi but fpu version is not mentioned for a particular architectute.
The FPU is defined by ARM, hence you need to look at the ARM core definitions. Note that FPU is optional for the cores, so you do need to check the silicon vendors' doc on whether they include the FPU or not.
For Cortex-M4, the optional FPU is 32-bits, i.e. single precision FP. Note that this means that double precision (i.e. 64-bit FP) is done without using the FPU.
Cortex-M7 definition includes an optional 64-bit FPU and can execute both single and double precision FP instructions.
Orthogonal to the FPU used is the calling convention that your program uses. As related to FP. basically it means whether to pass function arguments in FP registers on normal ARM registers.
The arm community suggested the following answer
"ARM Cortex‑M4 Processor Technical Reference Manual" gives this information
ARM Cortex-M4 TRM
Section 7.1 about fpu says "The Cortex-M4 FPU is an implementation of the single precision variant of the ARMv7-M Floating Point Extension(FPv4-SP)"
Also the 32 single precision registers can be combined into 16 double precision ones (d16) hence fpv4-sp-d16

ARM/Thumb interworking confusion regarding Thumb-2

I've been going through ARM ISA related documentation since a while and so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking. I'll quickly summarize that in the following:
Instructions can be either 4 byte aligned (ARM) or 2 byte aligned (Thumb).
Thumb and ARM instructions reside in separate regions i.e. they are not intermixed without explicit processor state change.
State change can happen upon executing either of bx, blx, ldm, ldr. Choosing between ARM or Thumb depends on the value of the least significant bit in the address which can be 0 or 1 respectively.
The current state of the processor can be either ARM or thumb. That depends on the state of bit 5 of CPSR.
Rules for state change can be summarized in the following figure taken from this paper:
However, Thumb-2 instructions have confused me a bit. For instance, let's inspect the encoding of instruction ADC which can be found in section A8.8.2 of the ARMv7-A/R reference manual. Basically, the same instruction has 3 distinct encodings 16 bit (Thumb), 32 bit (Thumb2), and 32 bit (ARM).
Here are my questions:
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor? (I'm assuming its the latter but not sure)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Choicing a mode
so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking.
Well, that is useful, it is really part of an older story. Originally, there was only ARM 32-bit instructions (1980-mid 1990s). Then ARM made a mode that was like a compression front-end that expanded a strictly 16bit opcodes to 32 bits. This was thumb mode (mid 1990s to ~2005). Then ARM came out with thumb2 (which is somewhat nebulous) mainly typified by a mix of both 16bit and 32bit instructions (~2005 to current).
The concept of interworking is only useful for a CPU with thumb (old) and ARM functions. If you have a thumb2 CPU and a good compiler with normal memory (1+ wait states), then the thumb2 is almost always the best choice.
Thumb2 intermixing
In a thumb2 capable processor, you do not need interworking! Ie, you don't change modes. You can use the thumb 16bit encodings and if you ask for a mnemonic where this is not possible, the assembler emits a 32bit version. The Cortex-M CPUs only have a thumb2 mode (really thumb mode with instruction extensions).
Disassembling
There are not really three types of opcodes but two with one extension.
Original 32 bit ARM opcodes.
16 bit only thumb encodings.
the thumb2 extension with all thumb opcodes plus more.
As the thumb opcodes are more dense, it is not possible to do all types of operations. So the thumb ADC is limited compared to the ARM. However, for most instructions ARM Holding updated the thumb2 (the only mode in the CPU is thumb; thumb2 is extra instructions/opcodes) to have all the capabilities of the ARM mode ADC.
There are discussions on recognizing the mode in a binary elsewhere. Assuming the code is not trying to obfuscate and people made rational choices, you will only have a two types of disassembly.
ARM 32 bit
thumb2
A thumb2 disassembler should work with pure thumb code. Most people do not use interworking. If they do, a large part of the binary will be thumb mode, with a small performance critical section in ARM mode.
A difficulty with thumb2 is the mixed 16/32 bit can lead a disassembler to mis-interpret an instruction stream if it decodes a 32bit encoding mid stream.
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Interworking makes no sense on a thumb2 CPU. Since you question is tagged disassembling, I tried to answer with that focus versus the other questions that is mainly about what the modes are. For elf disassembly, the disassembler should have no trouble to locate major function entry points and should be able to disassemble without much issues.
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor?
Thumb-2 instructions are accessible as were Thumb instructions when the processor is in Thumb state, that is, the T bit in the CPSR is 1 and the J bit in the CPSR is 0. (source)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
No state change needs to happen, since Thumb-2 instructions and ordinary Thumb instructions execute in the same state. As for how this fits with the instruction encoding, the ARM Architecture Reference Manual : Thumb-2 Supplement says this:
The new 32-bit Thumb instructions are added in the space previously occupied by the Thumb BL and BLX
instructions. This is made possible by treating the BL and BLX instructions as 32-bit instructions, instead of
treating them as two 16-bit instructions.

Difference between Thumb2 and ARM when an interruption occurs

I am porting a project to the Freescale TWR-K60F120M development board and a Kinetis K60 32-bit ARM® Cortex™-M4 MCU. While manipulating assembly code, I came accross a function that saves a Task context in specific registers.
Does anyone know in which registers the Task context is saved when an interruption occurs for thumb2 ( Cortex™-M4 instruction set) ?
Thanks.
The arm architetural refernce documents are quite clear on how this works. You need to refer to the documents for the core you are using for specific details in case there are differences. The cortex-m vs non-cortex-m are definitely quite different. the non-cortex-m (cortex-a, arm11, etc) have pseudo code in the documentation for each handler and I believe that they switch to arm mode. The only processors with arm mode and thumb2 are the most recent cortex-a's. so if you are asking what is the difference between a cortex-m and non-cortex-m. again that is well documented in the arm docs, but:
the cortex-m is designed for not needing to have assembly language wrappers (or compiler specific directives that generate that additional assembly) in order to protect gprs and return with the right instruction. The cortex-m does that in hardware and is designed to be able to have the address of a C function right in the interrupt vector table. The non-cortex-ms generally dont support thumb2, but when in thumb mode or arm mode I believe they switch to arm mode for the handler which you can switch back of course. You have separate stacks on a non cortex-m and you have banked registers. so depending on the interrupt and your handler you may need to preserve more interrupts, and you certainly cannot simply return with bx lr you have to use the proper return instruction based on the exception.
also the cortex-m uses a list of addresses in the vector table, where a traditional arm uses a list of instructions (usually you need to use branch b or ldr pc to get out of the table in one instruction).

ARM Thumb/Thumb-2 performance

I am working on an ARM Cortex-M3 controller which has the Thumb-2 instruction set.
Thumb mode is used to compress the instruction to a 16-bit size.
So size of code is reduced. But with normal Thumb mode, why is it said that performance is reduced?
In case of Thumb-2, it is said performance is improved as per these two links:
Wikipedia
Arm.com
Improve performance in cases where a single 16-bit instruction restricts functions available to the compiler.
A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
What exactly is this performance? Can someone give a few examples related to it?
When compared against the ARM 32 bit instruction set, the thumb 16 bit instruction set (not talking about thumb2 extensions yet) takes less space because the instructions are half the size, but there is a performance drop, in general, because it takes more instructions to do the same thing as on arm. There are less features to the instruction set, and most instructions only operate on registers r0-r7. Apples to Apples comparison more instructions to do the same thing is slower.
Now thumb2 extensions take formerly undefined thumb instructions and create 32 bit thumb instructions. Understand that there is more than one set of thumb2 extensions. ARMv6m adds a couple dozen perhaps. ARMv7m adds something like 150 instructions to the thumb instruction set, I dont know what ARMv8 or the future hold. So assuming ARMv7m, they have bridged the gap between what you can do in thumb and what you can do in ARM. So thumb2 is a reduced ARM instruction set as thumb is, but not as reduced. So it might still take more instructions to do the same thing in thumb2 (assume plus thumb) compared to ARM doing the same thing.
This gives a taste of the issue, a single instruction in arm and its equivalent in thumb.
ARM
and r8,r9,r10
THUMB
push {r0,r1}
mov r0,r8
mov r1,r9
and r0,r1
mov r1,r10
and r0,r1
mov r8,r0
pop {r0,r1}
Now a compiler wouldnt do that, the compiler would know it is targeting thumb and do things differently by choosing other registers. You still have fewer registers and fewer features per instruction:
mov r0,r1
and r0,r2
Still takes two instructions/execution cycles to and two registers together, without modifying the operands, and put the result in a third register. Thumb2 has a three register and so you are back to a single instruction using the thumb2 extensions. And that thumb2 instruction allows for r0-r15 on any of those three registers where thumb is limited to r0-r7.
Look at the ARMv5 Architectural Reference Manual, under each thumb instruction it shows you the equivalent ARM instruction. Then go to that ARM instruction and compare what you can do with that arm instruction that you cant do with the thumb instruction. It is a one way path the thumb instructions (not thumb2) have a one to one relationship with an ARM instruction. all thumb instructions have an equivalent arm instruction. but not all arm instructions have an equivalent thumb instruction. You should be able to see from this exercise the limitation on the compilers when using the thumb instruction set. Then get the ARMv7m Architectural Reference Manual and look at the instruction set, and compare the "all thumb variants" encodings (the ones that include ARMv4T) and the ones that are limited to ARMv6 and/or v7 and see the expansion of features between thumb and thumb2 as well as the thumb2 only instructions that have no thumb counterpart. This should clarify what the compilers have to work with between thumb and thumb2. You can then go so far as to compare thumb+thumb2 with the full blown ARM instructions (ARMv7 AR is that what it is called?). And see that thumb2 gets a lot closer to ARM, but you lose for example conditionals on every instruction, so conditional execution in thumb becomes comparisons with branching over code, where in ARM you can sometimes have an if-then-else without branching...
Thumb-2 introduced variable length instructions to the original Thumb; now instructions can be a mixture of 16-bit and 32-bit. That means you retain the size advantage of the original Thumb in everyday code, but now have access to almost the full ARM feature-set in more complex code, but without the ARM-interworking overhead previously incurred by Thumb.
Aside from the aforementioned access to the full register set from all register operations, Thumb-2 added back branchless conditional execution in the form of the IF-THEN (IT) block. The original Thumb removed the trademark ARM feature of conditional execution on nearly all instructions; this is now achieved in Thumb-2 by prepending the IT instruction with conditions for up to four succeeding instructions.
In addition, the instruction set itself has been vastly expanded; for example, the Cortex-M4F implements the DSP extension as well as the FPv4-SP floating point extension. In fact, I believe even NEON can be encoded in Thumb2.
ARM 32bit
ARM is a 32bit instruction set. All opcodes are 32bits. The leading bits denote conditional execution. This is generally wasteful as 90+% of code executes unconditionally. The ARM mode supports 16 registers nearly symmetric (with some special cases for PC, LR and SP).
Most instruction included an 's' suffix to set condition codes.
Thumb 16bit
The original thumb is 16bit only opcodes. It does not support conditional execution and access was mainly restricted to the lower eight registers. All arithmetic instructions set condition codes. Some instructions could retrieve data from the higher registers. It can be looked at as a compression engine on the instruction decode.
For some algorithms and memory topology, thumb can be faster than ARM. However it is fairly rare and needs slow (non-zero wait state) instruction memory for this to be the case.
As a practical example, some 'Game boy advance' code would be mainly execute in thumb mode, but would jump to zero wait state RAM and transition to ARM mode for a performance critical routine.
Thumb2 mixed mode
Thumb2 extended the thumb ISA but allows for both 16bit and 32bit opcodes. Almost the entire original ARM instruction set functionality can be achieved with Thumb2. Since the instruction stream is more dense, it is higher performance than the original ARM in almost every case due to lower instruction fetch overhead.
Thumb2 allows conditional execution for four instructions with 'if/else' opcode conditions. It allows use of all 16 registers and .unified code can be written to produce either ARM 32bit or mixed Thumb2 code.
Unified code will always be faster when Thumb2 is selected. There are fairly rare ARM sequences that can not be encoded directly to Thumb2. These few cases snippets could be faster. But generally, for any large code base, Thumb2 is faster.
This mode can be confusing with loop unrolling and jump tables. It is something that an x86 programmer would naturally think of. Ie, there are '.n'/narrow/16bit and '.w'/wide/32bit encodings of identical instructions. So if you treat code as an 'array' of tasks, the computations can be more complex. You also have transfer of control to mid-instruction possibilities.
As an example of 'un-encodeable' Thumb2 ARM code,
movlo r0,#1
moveq r0,#0
movhi r0,#-1
Above is only possible in ARM mode. However, such sequences are very rare and would only matter if you are porting assembler code from ARM to Thumb2. If it is selecting a compiler mode, Thumb2 should always produce better code (faster and smaller).
Summary
Each mode has variations on available opcodes depending on CPU model. However, the general concepts of each mode and performance are as stated.

Resources