Difference between Thumb2 and ARM when an interruption occurs - arm

I am porting a project to the Freescale TWR-K60F120M development board and a Kinetis K60 32-bit ARM® Cortex™-M4 MCU. While manipulating assembly code, I came accross a function that saves a Task context in specific registers.
Does anyone know in which registers the Task context is saved when an interruption occurs for thumb2 ( Cortex™-M4 instruction set) ?
Thanks.

The arm architetural refernce documents are quite clear on how this works. You need to refer to the documents for the core you are using for specific details in case there are differences. The cortex-m vs non-cortex-m are definitely quite different. the non-cortex-m (cortex-a, arm11, etc) have pseudo code in the documentation for each handler and I believe that they switch to arm mode. The only processors with arm mode and thumb2 are the most recent cortex-a's. so if you are asking what is the difference between a cortex-m and non-cortex-m. again that is well documented in the arm docs, but:
the cortex-m is designed for not needing to have assembly language wrappers (or compiler specific directives that generate that additional assembly) in order to protect gprs and return with the right instruction. The cortex-m does that in hardware and is designed to be able to have the address of a C function right in the interrupt vector table. The non-cortex-ms generally dont support thumb2, but when in thumb mode or arm mode I believe they switch to arm mode for the handler which you can switch back of course. You have separate stacks on a non cortex-m and you have banked registers. so depending on the interrupt and your handler you may need to preserve more interrupts, and you certainly cannot simply return with bx lr you have to use the proper return instruction based on the exception.
also the cortex-m uses a list of addresses in the vector table, where a traditional arm uses a list of instructions (usually you need to use branch b or ldr pc to get out of the table in one instruction).

Related

How to detect FPU in Cortex M?

Cortex-M processors implement the CPUID register, through which it is possible to detect information about the core: part number (e.g. Cortex M7 or M4), revision and patch level (e.g. r1p2), etc.
Is there a register or a way to detect if the FPU has been implemented by the implementer? And how to detect the type of FPU (VFPv4, VFPv5-SP or VFPv5-DP)?
In the cortex-m Architecure Reference manual,
B3.2.20 Coprocessor Access Control Register, CPACR
The CPACR characteristics are:
Purpose: Specifies the access privileges for coprocessors
Usage constraints: If a coprocessor is not implemented, a write of 0b01 or 0b11 to the corresponding CPACR field reads back as 0b00.
Configurations: Always implemented
The VFP will have implemented CP10 and CP11 (decimal). If there is no VFP, then they should read back as 0b00. This would apply to a majority of Cortex-M CPUs. As a vendor can implement there own IP, it is possible that some CPU/SOC might not work as documented. It would be prudent to trap/handle the undefined instruction which will be taken if a Co-processor is not present.

ARM/Thumb interworking confusion regarding Thumb-2

I've been going through ARM ISA related documentation since a while and so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking. I'll quickly summarize that in the following:
Instructions can be either 4 byte aligned (ARM) or 2 byte aligned (Thumb).
Thumb and ARM instructions reside in separate regions i.e. they are not intermixed without explicit processor state change.
State change can happen upon executing either of bx, blx, ldm, ldr. Choosing between ARM or Thumb depends on the value of the least significant bit in the address which can be 0 or 1 respectively.
The current state of the processor can be either ARM or thumb. That depends on the state of bit 5 of CPSR.
Rules for state change can be summarized in the following figure taken from this paper:
However, Thumb-2 instructions have confused me a bit. For instance, let's inspect the encoding of instruction ADC which can be found in section A8.8.2 of the ARMv7-A/R reference manual. Basically, the same instruction has 3 distinct encodings 16 bit (Thumb), 32 bit (Thumb2), and 32 bit (ARM).
Here are my questions:
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor? (I'm assuming its the latter but not sure)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Choicing a mode
so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking.
Well, that is useful, it is really part of an older story. Originally, there was only ARM 32-bit instructions (1980-mid 1990s). Then ARM made a mode that was like a compression front-end that expanded a strictly 16bit opcodes to 32 bits. This was thumb mode (mid 1990s to ~2005). Then ARM came out with thumb2 (which is somewhat nebulous) mainly typified by a mix of both 16bit and 32bit instructions (~2005 to current).
The concept of interworking is only useful for a CPU with thumb (old) and ARM functions. If you have a thumb2 CPU and a good compiler with normal memory (1+ wait states), then the thumb2 is almost always the best choice.
Thumb2 intermixing
In a thumb2 capable processor, you do not need interworking! Ie, you don't change modes. You can use the thumb 16bit encodings and if you ask for a mnemonic where this is not possible, the assembler emits a 32bit version. The Cortex-M CPUs only have a thumb2 mode (really thumb mode with instruction extensions).
Disassembling
There are not really three types of opcodes but two with one extension.
Original 32 bit ARM opcodes.
16 bit only thumb encodings.
the thumb2 extension with all thumb opcodes plus more.
As the thumb opcodes are more dense, it is not possible to do all types of operations. So the thumb ADC is limited compared to the ARM. However, for most instructions ARM Holding updated the thumb2 (the only mode in the CPU is thumb; thumb2 is extra instructions/opcodes) to have all the capabilities of the ARM mode ADC.
There are discussions on recognizing the mode in a binary elsewhere. Assuming the code is not trying to obfuscate and people made rational choices, you will only have a two types of disassembly.
ARM 32 bit
thumb2
A thumb2 disassembler should work with pure thumb code. Most people do not use interworking. If they do, a large part of the binary will be thumb mode, with a small performance critical section in ARM mode.
A difficulty with thumb2 is the mixed 16/32 bit can lead a disassembler to mis-interpret an instruction stream if it decodes a 32bit encoding mid stream.
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Interworking makes no sense on a thumb2 CPU. Since you question is tagged disassembling, I tried to answer with that focus versus the other questions that is mainly about what the modes are. For elf disassembly, the disassembler should have no trouble to locate major function entry points and should be able to disassemble without much issues.
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor?
Thumb-2 instructions are accessible as were Thumb instructions when the processor is in Thumb state, that is, the T bit in the CPSR is 1 and the J bit in the CPSR is 0. (source)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
No state change needs to happen, since Thumb-2 instructions and ordinary Thumb instructions execute in the same state. As for how this fits with the instruction encoding, the ARM Architecture Reference Manual : Thumb-2 Supplement says this:
The new 32-bit Thumb instructions are added in the space previously occupied by the Thumb BL and BLX
instructions. This is made possible by treating the BL and BLX instructions as 32-bit instructions, instead of
treating them as two 16-bit instructions.

Where to find I bit and how to edit it to enable interrupts in ARM Cortex-M4

In ARM Cortex-M4F MCU (TM4C1294NCPDT specifically), to deal with interrupts (GPIO interrupts), one of the steps to get the interrupts working is to clear the I BIT.
I searched a lot but I couldn't find any useful information about that, could anybody please tell me where to find that bit exactly and how to edit it if I need some special procedures?
And that will be great if I have been told where to find that information exactly after the explanation please (to learn how to answer myself on any other questions).
The CMSIS provides a standard cross-vendor software interface to Cortex-M based devices. The CMSIS defines a number of functions for interacting with the NVIC and PRIMASK including the intrinsics __disable_irq()/__enable_irq()
The ARM Cortex-M interrupt system is quite complicated and very well thought. It consists of CPU registers and a tightly coupled interrupt controller (NVIC). Interrupts are prioritized and vectored. There is no single interrupt-enable flag as for smaller 8/16 bit MCUs.
For each interrupt, there are two ARM-core instances to gate the event to the CPU: The CPU PRIMASK register (single bit), which can be seen most similar to the classical interrupt-enable flag. Second is one enable bit in the NVIC. For these, there is an ARM standard in the CMSIS headers. These provide functions __enable_irq() and __disable_irq() for the PRIMASK bit. The peripheral interrupt itself has to be controlled by NVIC_EnableIRQ(IRQn_Type IRQn) where IRQn is the interrupt number as defined in the MCU-specific header file.
Finally, there are most times also interrupt enable bits in each peripheral module as know by the smaller MCUs.
Note that to have an interrupt pass through all gates have to be open (all bits set to "enable"). Use the CMSIS functions to manipuate the bits. They very likely will not take more instructions than a hand-crafted version.
Edit:
There is no actual need to fiddle yourself with assembler or the registers. Just use the CMSIS functions, you can very likely not do better yourself, but possibly worse. That's actually the intention of CMSIS.
(end edit)
Start reading in the reference manual for the MCU and the vendor's homepage. That should provide references and app-notes for the device. You also should read the technical reference manual, architecture reference manual from ARM. Actually, just have a close look at all related documents there for the CPU (M4 for you). These are for free, some require registering.
For the NVIC, you should not access it directly, but using the CMSIS header files as provided by TI for exactly this MCU (the headers require some device-specific settings). If not available,you can get them from ARM, but have to provide the device-specific settings yourself (they are few and are given in the MCU's reference manual).
As the ARM Cortex-M4 has multiple interrupts, you need their symbolic names to enable/disable. These have to be defined in the MCU header which defines all peripheral modules, too (there might be multiple such headers). The names end with _IRQn, just search for that.
To use the Cortex-M4 you should read the documents given, or you can try with a good book. However, as this is no tutorial site, nor is it allowed to recommend books, please search yourself.
OK, the easiest answer for my question is:
To use " CPSID I" or " CPSIE I" inline assembly code which will set or clear the PRIMASK (I) Bit respectively. (of course that will work just in privileged mode).
And both assembly instructions are equivalent to __disable_irq() and __enable_irq() functions in CMSIS respectively.

ARM Thumb/Thumb-2 performance

I am working on an ARM Cortex-M3 controller which has the Thumb-2 instruction set.
Thumb mode is used to compress the instruction to a 16-bit size.
So size of code is reduced. But with normal Thumb mode, why is it said that performance is reduced?
In case of Thumb-2, it is said performance is improved as per these two links:
Wikipedia
Arm.com
Improve performance in cases where a single 16-bit instruction restricts functions available to the compiler.
A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
What exactly is this performance? Can someone give a few examples related to it?
When compared against the ARM 32 bit instruction set, the thumb 16 bit instruction set (not talking about thumb2 extensions yet) takes less space because the instructions are half the size, but there is a performance drop, in general, because it takes more instructions to do the same thing as on arm. There are less features to the instruction set, and most instructions only operate on registers r0-r7. Apples to Apples comparison more instructions to do the same thing is slower.
Now thumb2 extensions take formerly undefined thumb instructions and create 32 bit thumb instructions. Understand that there is more than one set of thumb2 extensions. ARMv6m adds a couple dozen perhaps. ARMv7m adds something like 150 instructions to the thumb instruction set, I dont know what ARMv8 or the future hold. So assuming ARMv7m, they have bridged the gap between what you can do in thumb and what you can do in ARM. So thumb2 is a reduced ARM instruction set as thumb is, but not as reduced. So it might still take more instructions to do the same thing in thumb2 (assume plus thumb) compared to ARM doing the same thing.
This gives a taste of the issue, a single instruction in arm and its equivalent in thumb.
ARM
and r8,r9,r10
THUMB
push {r0,r1}
mov r0,r8
mov r1,r9
and r0,r1
mov r1,r10
and r0,r1
mov r8,r0
pop {r0,r1}
Now a compiler wouldnt do that, the compiler would know it is targeting thumb and do things differently by choosing other registers. You still have fewer registers and fewer features per instruction:
mov r0,r1
and r0,r2
Still takes two instructions/execution cycles to and two registers together, without modifying the operands, and put the result in a third register. Thumb2 has a three register and so you are back to a single instruction using the thumb2 extensions. And that thumb2 instruction allows for r0-r15 on any of those three registers where thumb is limited to r0-r7.
Look at the ARMv5 Architectural Reference Manual, under each thumb instruction it shows you the equivalent ARM instruction. Then go to that ARM instruction and compare what you can do with that arm instruction that you cant do with the thumb instruction. It is a one way path the thumb instructions (not thumb2) have a one to one relationship with an ARM instruction. all thumb instructions have an equivalent arm instruction. but not all arm instructions have an equivalent thumb instruction. You should be able to see from this exercise the limitation on the compilers when using the thumb instruction set. Then get the ARMv7m Architectural Reference Manual and look at the instruction set, and compare the "all thumb variants" encodings (the ones that include ARMv4T) and the ones that are limited to ARMv6 and/or v7 and see the expansion of features between thumb and thumb2 as well as the thumb2 only instructions that have no thumb counterpart. This should clarify what the compilers have to work with between thumb and thumb2. You can then go so far as to compare thumb+thumb2 with the full blown ARM instructions (ARMv7 AR is that what it is called?). And see that thumb2 gets a lot closer to ARM, but you lose for example conditionals on every instruction, so conditional execution in thumb becomes comparisons with branching over code, where in ARM you can sometimes have an if-then-else without branching...
Thumb-2 introduced variable length instructions to the original Thumb; now instructions can be a mixture of 16-bit and 32-bit. That means you retain the size advantage of the original Thumb in everyday code, but now have access to almost the full ARM feature-set in more complex code, but without the ARM-interworking overhead previously incurred by Thumb.
Aside from the aforementioned access to the full register set from all register operations, Thumb-2 added back branchless conditional execution in the form of the IF-THEN (IT) block. The original Thumb removed the trademark ARM feature of conditional execution on nearly all instructions; this is now achieved in Thumb-2 by prepending the IT instruction with conditions for up to four succeeding instructions.
In addition, the instruction set itself has been vastly expanded; for example, the Cortex-M4F implements the DSP extension as well as the FPv4-SP floating point extension. In fact, I believe even NEON can be encoded in Thumb2.
ARM 32bit
ARM is a 32bit instruction set. All opcodes are 32bits. The leading bits denote conditional execution. This is generally wasteful as 90+% of code executes unconditionally. The ARM mode supports 16 registers nearly symmetric (with some special cases for PC, LR and SP).
Most instruction included an 's' suffix to set condition codes.
Thumb 16bit
The original thumb is 16bit only opcodes. It does not support conditional execution and access was mainly restricted to the lower eight registers. All arithmetic instructions set condition codes. Some instructions could retrieve data from the higher registers. It can be looked at as a compression engine on the instruction decode.
For some algorithms and memory topology, thumb can be faster than ARM. However it is fairly rare and needs slow (non-zero wait state) instruction memory for this to be the case.
As a practical example, some 'Game boy advance' code would be mainly execute in thumb mode, but would jump to zero wait state RAM and transition to ARM mode for a performance critical routine.
Thumb2 mixed mode
Thumb2 extended the thumb ISA but allows for both 16bit and 32bit opcodes. Almost the entire original ARM instruction set functionality can be achieved with Thumb2. Since the instruction stream is more dense, it is higher performance than the original ARM in almost every case due to lower instruction fetch overhead.
Thumb2 allows conditional execution for four instructions with 'if/else' opcode conditions. It allows use of all 16 registers and .unified code can be written to produce either ARM 32bit or mixed Thumb2 code.
Unified code will always be faster when Thumb2 is selected. There are fairly rare ARM sequences that can not be encoded directly to Thumb2. These few cases snippets could be faster. But generally, for any large code base, Thumb2 is faster.
This mode can be confusing with loop unrolling and jump tables. It is something that an x86 programmer would naturally think of. Ie, there are '.n'/narrow/16bit and '.w'/wide/32bit encodings of identical instructions. So if you treat code as an 'array' of tasks, the computations can be more complex. You also have transfer of control to mid-instruction possibilities.
As an example of 'un-encodeable' Thumb2 ARM code,
movlo r0,#1
moveq r0,#0
movhi r0,#-1
Above is only possible in ARM mode. However, such sequences are very rare and would only matter if you are porting assembler code from ARM to Thumb2. If it is selecting a compiler mode, Thumb2 should always produce better code (faster and smaller).
Summary
Each mode has variations on available opcodes depending on CPU model. However, the general concepts of each mode and performance are as stated.

Want to configure a particular peripheral register in ARM9 based chip

I have verilog based verification envirnoment for ARM based chip. I have to write new tests in
C++ to verifiy a peripheral. I have all ARM based GCC tools in place. I do not know how to make a
particular peripheral register visible in C++ based test. I want to write to this register, want to wait for the interrupt from the peripheral and then want to read back status of another peripheral register.
I would like to to know how can it be done? Which documentation from ARM should I refer to.
I tried and find all documentations are for system developers
I need the basic information.
Regards
Manish
You will eventually if not immediately want the ARM ARM (yes ARM twice, once for ARM the second one for Architectural Reference Manual, you can google it and find it for free as a download). Second you want the TRM, Technical Reference Manual for the specific core in your chip. ARM doesn't make chips they make processor cores that other people put in their chips so the company that has the chip may or may not have the TRM included in their documentation. If you have an arm core in verilog then I assume you purchased it and that means you have the specific TRM for the specific core that you purchased available, plus any add ons (like a cache for example).
You can take this with a grain of salt but I have done what you are doing for many years (testing in simulation and later on the real chip) now and my preference is to write my C code as if it were going to be running embedded on the arm. Well in this case perhaps you are running embedded on the arm.
Instead of something like this:
#define SOMEREG (*(volatile unsigned int *)0X12345678)
and then in your code
SOMEREG = 0xabc;
or
somevariable = SOMEREG;
somevariable |= 0x10;
SOMEREG = somevariable;
My C code uses external functions.
extern unsigned int GET32 ( unsigned int address );
extern void PUT32 ( unsigned int address, unsigned int data);
somevariable = GET32(0x12345678);
somevariable|=0x10;
PUT32(0x12345678,somevariable);
When running on the chip in or out of simulation:
.globl PUT32
PUT32:
str r1,[r0]
bx lr ;# or mov pc,lr depending on architecture
.globl PUT16
PUT16:
strh r1,[r0]
bx lr
.globl GET32
GET32:
ldr r0,[r0] ;# I know what the ARM ARM says, this works
bx lr
.globl GET16
GET16
ldrh r0,[r0]
bx lr
Say you name the file it putget.s
arm-something-as putget.s -o putget.o
then link putget.o in with your C/C++ objects.
I have had gcc and pretty much every other compiler fail to get the *volatile thing to work 100%, usually right after you release your code to the manufacturing folks to take your tests and run them on the product is when it fails and you have to stop production and re-write or re-tune a bunch of code to get the compiler not confused again. The external function approach has worked 100% on all compilers, the only drawback is the performance when running embedded, but the benefits of abstraction across all interfaces and operating systems pays you back for that.
I assume you are doing one of two things, either you are running code on the simulated arm trying to talk to something tied to the simulated arm. Eventually I assume the code will be doing that so you will have to get into tools and linker issues, which there are many examples out there, some of my own as well, just like building a gcc cross compiler, trivial once shown the first time. If this is a peripheral that will eventually be tied to an arm, but for now is outside the core but inside the design, meaning it hopefully is memory mapped and is tied to the arms memory interface (amba, axi, etc).
for the first case you have to overcome the embedded hurdle, you will need to build bootable code, probably rom/flash based (read only) as that is likely how the arm/chip will boot, dealing with the linker scripts to separate the rom/ram. Here is my advice on that eventually if not now the hardware engineers will want to simulate the rom timing, which is painfully slow in simulation. Compile your program to run completely from ram (other than the exception table which is a separate topic), compile to a binary format that you are willing to write an ad hoc utility for reading, elf is easy, so is ihex and srec, none as easy as a plain old binary .bin. What you ultimately want to do is write some assembler that boots up on the virtual prom/flash, enables the instruction cache (if they have that implemented and working in simulation, if not then wait on that step) uses the ldm amd stm instructions in a loop to copy as many words at a time as you can to ram, then branch to ram. I have a host based utility that takes the .bin file creates an assembler program that includes the assembler that copies the binary to ram and embed the binary itself as .words in the assembler, then assemble and link that program to a format the simulation can use. Do not let the hardware engineers convince you that you have to re-build the verilog every time, you can use a $readmemh() or some other such thing in verilog to read a file at runtime and not have to re-compile the verilog to change the arm binary. You will want to write an ad hoc host based utility to convert your .bin or .elf or whatever to a file that the verilog can read, readmemh is trivial... So I am getting off on a tangent, use the put/get to talk to registers, you have to use the TRM and the ARM ARM to place the interrupt handler code somewhere, you have to enable the interrupt, most likely in more than one place in the arm as well as in the peripheral. The beauty of simulating is that you can watch your code execute and you can see the interrupt leave the peripheral and debug your code based on what you see, with a real chip you dont know if your code is failing to create the interrupt or if your code is failing to enable the interrupt or if the interrupt is working but you made a mistake in the interrupt handler, with a verilog simulator you can see all of this and you should strive to learn to read the waveforms and not rely on the hardware engineers to do it for you. modelsim or cadence or whomever can save the waveforms in .vcd format and you can use a free tool named gtkwave to view the waveforms. Dont let them convince you that they dont have any more licences available for you to look at stuff.
All of that is secondary, if this is an off core but on chip peripheral then you probably want to test that logic without the arm core first. If you dont know verilog, its easy you should just look at the code and you can figure it out. Software engineers can pick it up in a few days or a week if already experienced in languages, particularly C. Either way, the hardware engineer likely has a test bench for the peripheral, you create or have them create a test bench with a register that is similar to what you will see once connected to the arm, either directly on the arm bus or on a test bench interface that simplifies the arm bus. then use vpi, which is ugly but works (google foreign language interface as well as vpi) to connect C code on the host machine running the simulation. Do most of your work in C and verilog minimizing the vpi nightmare. Because this is compiled and linked to the simulation in a sense you do not want to have to re-build the sim every time you want to change your test program. So use something like sockets or some other IPC interface so that you can separate from the vpi code. Then write some host code that implements put32 and get32 (put8, put16, whatever functions you want to implement). so now you take your test program that can run on the arm if compiled that way and instead compile it on the lost linking it to the put/get/whatever abstraction layer. Now you can write programs that for now run on the host but interact with the peripheral in simulation as if it were real hardware and as if your host programs were embedded programs in the arm. the interrupt is likely trivial in this environment as all you have to do is either look for it in the waveforms or have the vpi code print something on the console when the signal changes states or something like that.
Oh, the reason for copying from rom to ram then running from ram is that on average your sim times will be significantly shorter, fives and tens of minutes instead of hours. simulating the peripheral by itself without the arm using a foreign language interface to bridge to/from the host, cuts your sim time from fives of minutes to seconds depending on what you are doing. If you use some sort of abstraction like my put/get you can write your peripheral code one time in one file, linking it different ways that one file/program/function can be used with the perhipheral only in simulation for quickly developing your code, then run with the arm in place in simulation adding the complexity of the arm exceptions and arm interrupt system, and later on the real chip as you were running on the simulated chip. and then later that code can hopefully be used as is in a driver or application space using mmap, etc.

Resources