what is the use of MSRNE instruction in ARM - arm

Not able to find any documentation on this instruction
Is this a macro or an instruction. It is used mainly in context switch but not able to undetstand its purpose

This is an MSR instruction, conditionally executed as Not Equal (NE).
MSR is used to move a value from a general purpose register to a system co-processor register. This can be used for all manner of things, as the system co-processor allows. It is often used for things such as cache invalidation/flushing.
The NE part makes the instruction dependant on the Zero status flag being set to zero, this occurs as the result of a previous flag-setting operation.

Related

ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)

In ARM SVE there are masked load instructions svld1and there are also non-failing loads
svldff1(svptrue<>).
Questions:
Does it make sense to do svld1 with a mask as opppose to svldff1?
The behaviour of mask in svldff1 seems confusing. Is there a practical reason to provide a not just svptrue mask for svldff1
Is there any performance difference between svld1 and svldff1
Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.
Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...
Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).
"Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.

How to detect FPU in Cortex M?

Cortex-M processors implement the CPUID register, through which it is possible to detect information about the core: part number (e.g. Cortex M7 or M4), revision and patch level (e.g. r1p2), etc.
Is there a register or a way to detect if the FPU has been implemented by the implementer? And how to detect the type of FPU (VFPv4, VFPv5-SP or VFPv5-DP)?
In the cortex-m Architecure Reference manual,
B3.2.20 Coprocessor Access Control Register, CPACR
The CPACR characteristics are:
Purpose: Specifies the access privileges for coprocessors
Usage constraints: If a coprocessor is not implemented, a write of 0b01 or 0b11 to the corresponding CPACR field reads back as 0b00.
Configurations: Always implemented
The VFP will have implemented CP10 and CP11 (decimal). If there is no VFP, then they should read back as 0b00. This would apply to a majority of Cortex-M CPUs. As a vendor can implement there own IP, it is possible that some CPU/SOC might not work as documented. It would be prudent to trap/handle the undefined instruction which will be taken if a Co-processor is not present.

Does the ARM ITE instruction have any use in this scenario

My C compiler (GCC) is producing this code which I don't think is optimal
8000500: 2b1c cmp r3, #28
8000502: bfd4 ite le
8000504: f841 3c70 strle.w r3, [r1, #-112]
8000508: f841 0c70 strgt.w r0, [r1, #-112]
It seems to me that the compiler could happily omit ITE LE instruction as the two stores following it use the LE and GT flags from the CMP instruction so only one will actuall be performed. The ITE instruction means that only one of the STRs will be tested and performed so the time should be equal, but it is using an extra word of instruction memory.
Any opinions on this ?
In Thumb mode, the instruction opcodes (other than branch instructions) don't have any space for conditional execution. In Thumb1, this meant that one simply had to use branches to skip instructions if necessary.
In Thumb2 mode, the IT instruction was added, which adds the conditional execution capability, without embedding it into the instruction opcodes themselves. In your case, the le condition part of the strle.w instruction is not embedded in the opcode f841 3c70, but is actually inferred from the preceding ite le instruction by the disassembler. If you use a hex editor to change the ite le instruction to something else, the strle.w and strgt.w will both suddenly disassemble into plain str.w.
See the other linked answer, https://stackoverflow.com/a/26001101, for more details.
The unified assembler syntax, which supports A32 and T32 targets, has added some confusion here. What is being shown in the disassembly is more verbose than what is encoded in the opcodes.
Your ITE instruction is very much a thumb instruction set placeholder, it defines an IT block which spans the following two instructions (and being thumb, those two instructions are not individually conditional). From a micro-architecture/timing point of view, it is only necessary to execute one instruction (but you shouldn't assume that this folding always takes place).
The strle/strgt syntax could be used on it's own for a T32 target, where the IT block is not necessary since the instruction set has a dedicated condition code field.
In order to write (or disassemble) code which can be used by both A32 and T32 assemblers, what you have here is both approaches to conditional execution written together. This has the advantage that the same assembly routine can be more portable (even if the resulting code is not identical - optimisations in the target cpu will also be different).
With T32, the combination of an it and a single 16 bit instruction matches the instruction density of the equivalent A32 instruction, if more than one conditional instruction can be combined, there is an overall win.

ARM Thumb/Thumb-2 performance

I am working on an ARM Cortex-M3 controller which has the Thumb-2 instruction set.
Thumb mode is used to compress the instruction to a 16-bit size.
So size of code is reduced. But with normal Thumb mode, why is it said that performance is reduced?
In case of Thumb-2, it is said performance is improved as per these two links:
Wikipedia
Arm.com
Improve performance in cases where a single 16-bit instruction restricts functions available to the compiler.
A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
What exactly is this performance? Can someone give a few examples related to it?
When compared against the ARM 32 bit instruction set, the thumb 16 bit instruction set (not talking about thumb2 extensions yet) takes less space because the instructions are half the size, but there is a performance drop, in general, because it takes more instructions to do the same thing as on arm. There are less features to the instruction set, and most instructions only operate on registers r0-r7. Apples to Apples comparison more instructions to do the same thing is slower.
Now thumb2 extensions take formerly undefined thumb instructions and create 32 bit thumb instructions. Understand that there is more than one set of thumb2 extensions. ARMv6m adds a couple dozen perhaps. ARMv7m adds something like 150 instructions to the thumb instruction set, I dont know what ARMv8 or the future hold. So assuming ARMv7m, they have bridged the gap between what you can do in thumb and what you can do in ARM. So thumb2 is a reduced ARM instruction set as thumb is, but not as reduced. So it might still take more instructions to do the same thing in thumb2 (assume plus thumb) compared to ARM doing the same thing.
This gives a taste of the issue, a single instruction in arm and its equivalent in thumb.
ARM
and r8,r9,r10
THUMB
push {r0,r1}
mov r0,r8
mov r1,r9
and r0,r1
mov r1,r10
and r0,r1
mov r8,r0
pop {r0,r1}
Now a compiler wouldnt do that, the compiler would know it is targeting thumb and do things differently by choosing other registers. You still have fewer registers and fewer features per instruction:
mov r0,r1
and r0,r2
Still takes two instructions/execution cycles to and two registers together, without modifying the operands, and put the result in a third register. Thumb2 has a three register and so you are back to a single instruction using the thumb2 extensions. And that thumb2 instruction allows for r0-r15 on any of those three registers where thumb is limited to r0-r7.
Look at the ARMv5 Architectural Reference Manual, under each thumb instruction it shows you the equivalent ARM instruction. Then go to that ARM instruction and compare what you can do with that arm instruction that you cant do with the thumb instruction. It is a one way path the thumb instructions (not thumb2) have a one to one relationship with an ARM instruction. all thumb instructions have an equivalent arm instruction. but not all arm instructions have an equivalent thumb instruction. You should be able to see from this exercise the limitation on the compilers when using the thumb instruction set. Then get the ARMv7m Architectural Reference Manual and look at the instruction set, and compare the "all thumb variants" encodings (the ones that include ARMv4T) and the ones that are limited to ARMv6 and/or v7 and see the expansion of features between thumb and thumb2 as well as the thumb2 only instructions that have no thumb counterpart. This should clarify what the compilers have to work with between thumb and thumb2. You can then go so far as to compare thumb+thumb2 with the full blown ARM instructions (ARMv7 AR is that what it is called?). And see that thumb2 gets a lot closer to ARM, but you lose for example conditionals on every instruction, so conditional execution in thumb becomes comparisons with branching over code, where in ARM you can sometimes have an if-then-else without branching...
Thumb-2 introduced variable length instructions to the original Thumb; now instructions can be a mixture of 16-bit and 32-bit. That means you retain the size advantage of the original Thumb in everyday code, but now have access to almost the full ARM feature-set in more complex code, but without the ARM-interworking overhead previously incurred by Thumb.
Aside from the aforementioned access to the full register set from all register operations, Thumb-2 added back branchless conditional execution in the form of the IF-THEN (IT) block. The original Thumb removed the trademark ARM feature of conditional execution on nearly all instructions; this is now achieved in Thumb-2 by prepending the IT instruction with conditions for up to four succeeding instructions.
In addition, the instruction set itself has been vastly expanded; for example, the Cortex-M4F implements the DSP extension as well as the FPv4-SP floating point extension. In fact, I believe even NEON can be encoded in Thumb2.
ARM 32bit
ARM is a 32bit instruction set. All opcodes are 32bits. The leading bits denote conditional execution. This is generally wasteful as 90+% of code executes unconditionally. The ARM mode supports 16 registers nearly symmetric (with some special cases for PC, LR and SP).
Most instruction included an 's' suffix to set condition codes.
Thumb 16bit
The original thumb is 16bit only opcodes. It does not support conditional execution and access was mainly restricted to the lower eight registers. All arithmetic instructions set condition codes. Some instructions could retrieve data from the higher registers. It can be looked at as a compression engine on the instruction decode.
For some algorithms and memory topology, thumb can be faster than ARM. However it is fairly rare and needs slow (non-zero wait state) instruction memory for this to be the case.
As a practical example, some 'Game boy advance' code would be mainly execute in thumb mode, but would jump to zero wait state RAM and transition to ARM mode for a performance critical routine.
Thumb2 mixed mode
Thumb2 extended the thumb ISA but allows for both 16bit and 32bit opcodes. Almost the entire original ARM instruction set functionality can be achieved with Thumb2. Since the instruction stream is more dense, it is higher performance than the original ARM in almost every case due to lower instruction fetch overhead.
Thumb2 allows conditional execution for four instructions with 'if/else' opcode conditions. It allows use of all 16 registers and .unified code can be written to produce either ARM 32bit or mixed Thumb2 code.
Unified code will always be faster when Thumb2 is selected. There are fairly rare ARM sequences that can not be encoded directly to Thumb2. These few cases snippets could be faster. But generally, for any large code base, Thumb2 is faster.
This mode can be confusing with loop unrolling and jump tables. It is something that an x86 programmer would naturally think of. Ie, there are '.n'/narrow/16bit and '.w'/wide/32bit encodings of identical instructions. So if you treat code as an 'array' of tasks, the computations can be more complex. You also have transfer of control to mid-instruction possibilities.
As an example of 'un-encodeable' Thumb2 ARM code,
movlo r0,#1
moveq r0,#0
movhi r0,#-1
Above is only possible in ARM mode. However, such sequences are very rare and would only matter if you are porting assembler code from ARM to Thumb2. If it is selecting a compiler mode, Thumb2 should always produce better code (faster and smaller).
Summary
Each mode has variations on available opcodes depending on CPU model. However, the general concepts of each mode and performance are as stated.

Which are the different variable cycle ARM instructions?

I was reading this book "ARM System Developers Guide" by Elsevier and I came across this:
The ARM instruction set differs from the pure RISC definition in several ways that make
the ARM instruction set suitable for embedded applications:
Variable cycle execution for certain instructions — Not every ARM instruction executes in a single cycle. For example, load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The
transfer can occur on sequential memory addresses, which increases performance since
sequential memory accesses are often faster than random accesses. Code density is also
improved since multiple register transfers are common operations at the start and end
of functions.
Any other ARM instructions you guys can point out which take variable cycles to execute?
Cycle timings are micro architecture dependent, so you need to check particular implementation's technical reference manual (TRM). For example for Cortex-A9, it is described as being quite complicated.
The complexity of the Cortex-A9 processor makes it impossible to calculate precise timing information manually. The timing of an instruction is often affected by other concurrent instructions, memory system activity, and additional events outside the instruction flow.
However on the same document there are precise timings for data-processing, load and store, multiplication and some information about branch and serialization instructions.
For example from the same document you can see if shifting is involved AND instruction may take 1-2 cycles more depending on the shift source, which might be a constant embedded in instruction or read from a register.
Also next to book's note about load-store-multiple may vary on number of registers involved, they also vary if address is aligned or not.

Resources