I was wondering why does not ARM Instructions set the CPSR by default (like x86), but the S bit must be used in these cases? When Instructions dont change the CPSR offer better performance? For example an ADD instruction offers better performance than ADDS? Or what is the real deal?
It is for performance or perhaps was. if you always change flags then you have a hard time using one flag on multiple instructions without a branch which messes with your pipeline.
if(a==0)
{
b=b+1;
c=0;
}
else
{
b=0;
c=c+1;
}
traditionally you have to literally implement that with branches (pseudocode not real asm)
cmp a,0
bne notzero
add b,b,1
mov c,0
b waszero
notzero:
mov b,0
add c,c,1
waszero:
so you suffer a branch no matter what
but with conditional execution
cmp a,0
addeq b,b,1
moveq c,0
addne c,c,1
movne b,0
no branches you simply rip through the code, now the only way this can work is 1) you have an option per instruction to conditionally execute based on flags and 2) instructions that modify the flags have an option not to modify the flags
Depending on the processor family/architecture the add and maybe even mov will modify the flags, so you have to have both the conditional execution AND the option not to set flags. That is why arm has an adds and an add.
I think they got rid of all that with the 64 bit architecture so perhaps as interesting and cool as it was maybe it wasnt used enough or worth it or they just needed those four bits to keep all/some instructions to 32 bits.
I was wondering why does not ARM Instructions set the CPSR by default (like x86), but the S bit must be used in these cases?
It is a choice and it depends on context. The extra flexibility is only limited by a programmers imagination.
When Instructions don't change the CPSR offer better performance? For example an ADD instruction offers better performance than ADDS?
Most likely neverNote1. Ie, an instruction that doesn't set CPSR does not execute faster (less clocks) for the majority of ARM CPUs and instructions.
Or what is the real deal?
Consider some 'C' code,
int i, sum;
char *p = array; /* passed in */
for(i = 0, sum = 0; i < 10 ; i++)
sum += arrary[i];
return sum;
This can translate to,
mov r2, r0 ; get "array" to R2
mov r1, #10 ; counter (reverse direction)
mov r0, #0 ; sum = 0
1:
subs r1, #1 ; set conditions
add r0, [r2], #1 ; does not affect conditions.
bne 1b
bx lr
In this case, the loop body is simple. However, if there are no conditionals with-in the loop, then a compiler (or assembler programmer) may schedule the loop decrement where ever they like and still set the conditions to be tested much later. This can be more important with more complex logic and where the CPU may have stalls due to data dependencies. It can also be important with conditional execution.
The optional 'S' is more a feature of many instructions than a single instruction.
Note1: Some one can always make an ARM CPU and do this. You would have to look at data sheets. I don't know of any CPU that take more time to set conditions.
Related
My C compiler (GCC) is producing this code which I don't think is optimal
8000500: 2b1c cmp r3, #28
8000502: bfd4 ite le
8000504: f841 3c70 strle.w r3, [r1, #-112]
8000508: f841 0c70 strgt.w r0, [r1, #-112]
It seems to me that the compiler could happily omit ITE LE instruction as the two stores following it use the LE and GT flags from the CMP instruction so only one will actuall be performed. The ITE instruction means that only one of the STRs will be tested and performed so the time should be equal, but it is using an extra word of instruction memory.
Any opinions on this ?
In Thumb mode, the instruction opcodes (other than branch instructions) don't have any space for conditional execution. In Thumb1, this meant that one simply had to use branches to skip instructions if necessary.
In Thumb2 mode, the IT instruction was added, which adds the conditional execution capability, without embedding it into the instruction opcodes themselves. In your case, the le condition part of the strle.w instruction is not embedded in the opcode f841 3c70, but is actually inferred from the preceding ite le instruction by the disassembler. If you use a hex editor to change the ite le instruction to something else, the strle.w and strgt.w will both suddenly disassemble into plain str.w.
See the other linked answer, https://stackoverflow.com/a/26001101, for more details.
The unified assembler syntax, which supports A32 and T32 targets, has added some confusion here. What is being shown in the disassembly is more verbose than what is encoded in the opcodes.
Your ITE instruction is very much a thumb instruction set placeholder, it defines an IT block which spans the following two instructions (and being thumb, those two instructions are not individually conditional). From a micro-architecture/timing point of view, it is only necessary to execute one instruction (but you shouldn't assume that this folding always takes place).
The strle/strgt syntax could be used on it's own for a T32 target, where the IT block is not necessary since the instruction set has a dedicated condition code field.
In order to write (or disassemble) code which can be used by both A32 and T32 assemblers, what you have here is both approaches to conditional execution written together. This has the advantage that the same assembly routine can be more portable (even if the resulting code is not identical - optimisations in the target cpu will also be different).
With T32, the combination of an it and a single 16 bit instruction matches the instruction density of the equivalent A32 instruction, if more than one conditional instruction can be combined, there is an overall win.
I couldn't find any tutorial on how to set the flags for the Carry on ARM to 1 or to 0. Can anybody help me out with that?
As in put ARM CPU in different modes and linux's irqflags.h, setting the mode, IRQ and carry flags can all be done in the same way.
Generally the macros is like,
.macro set_cflag, temp_reg
mrs \temp_reg, cpsr
bic \temp_reg, \temp_reg, #(1<<29)
msr cpsr_f, \temp_reg
.endm
.macro clear_cflag, temp_reg
mrs \temp_reg, cpsr
orr \temp_reg, \temp_reg, #(1<<29)
msr cpsr_f, \temp_reg
.endm
It is three steps,
Read the old value.
Update the flag in a working register.
Write the value back.
Some additional details are 'atomic' behaviour. Ie, you might need to disable interrupts and memory faults, etc. For some user code or a simple 'polling mode' bare metal, the above is fine.
If you really want to be 'efficient'; looking at your surrounding context and known registers you can execute some instruction that you know will set/clear the carry flag. For instance, if R0 is '0' then adds r0,r0,r0 will clear the carry flag. An instruction like eors R0,R0,R0 will not touch the carry bit. It may depend on whether you also need to know about other NZV bits. The notation 'cpsr_f' will only alter the NZCV bits. You can use msr cpsr_f, #NZCV_bits if you want to set/clear all of these. Ie, you don't care about the old value of all of them arch restrictions . The other flags, like mode, IRQ, etc will remain untouched.
See also:
Heyrick on status register.
ARM Q bit
I have a small program that I'm trying to use to identify the CPU frequency programmatically.
My program is structured as follows:
Set an alarm
Increment register in a while(1) loop
Compute speed upon SIGALRM
Initially, I was using
register unsigned int cycles asm("r6");
...
while(1)
cycles++;
Upon using objdump, I noticed that this actually translated to the following:
9aa0: e1a03006 mov r3, r6
9aa4: e2833001 add r3, r3, #1
9aa8: e1a06003 mov r6, r3
9aac: eafffffb b 9aa0 <estimate_from_cycles+0x1cc>
Since I wasn't sure why this translated to 3 instructions, I tried using inline assembly instead:
register unsigned int cycles asm("r6");
...
while(1)
asm("add r6, r6, #1);
This translated to:
9aa0: e2866001 add r6, r6, #1
9aa4: eafffffd b 9aa0 <estimate_from_cycles+0x1cc>
Why did the previous implementation translate to 3 instructions?
On the ARM platform, b <label> instruction takes 3 cycles. Subtract operations on ARM however, use only 1 cycle.
Is there any way I can subtract from PC register?
Is subtract even allowed on PC?
Is there any other way to reduce the number of cycles required to implement the same logic?
edit: I'm using CodeSourcery's arm-none-linux-gnueabi- toolchain with no optimizations
The implementation translated to 3 instructions very likely because you didn't enable any optimisations.
However, from a quick test, it looks like you'll have to write inline assembly anyway because when I compile the following using -O3 -fomit-frame-pointer
void test(void) {
register unsigned int cycles asm("r6");
while(1) cycles++;
}
the routine simply got optimised down to
00000000 <test>:
0: eafffffe b 0 <test>
Even adding volatile is useless since the compiler knows for sure that writing to a CPU register will definitely not have any side effects (unlike memory) so it's fair for it to get optimised out.
Answering your other questions,
Is there any way I can subtract from PC register?
Is subtract even allowed on PC?
Yes, certainly. But I'm not sure if that still costs one cycle.
Is there any other way to reduce the number of cycles required to implement the same logic?
As a side note, your logic won't give very accurate results because your process may be switched out between when you start and finish measuring.
You're expecting:
< your process >
|<---------------your alarm duration----------------->|
When, it's probably more like (where | is a context switch):
<your process> | <other processes ...> | <your process>
|<---------------your alarm duration----------------->|
I'm trying to optimize some functions and I realized that I know next to nothing about how long certain things take.
I can ask all the questions here, but I'd rather just find a good article on the subject if anyone knows one.
I'm using IAR to write a program in C for an ATMEL SAM7S processor. I have a sort function that takes 500uS or so, and I wanted to see if I could speed it up. I could also just post it here but I was hoping to learn for myself.
Like, is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers? And how long does an operation like that take? Just one cycle or more? How long does multiplication take compared to subtraction?
Anyone know a place to look? I tried googling for some stuff but I couldn't come up with any useful search terms.
If anyone has an ideas on my specific function, I can post details. I'm basically trying to match two analog values to the closest index in a table of calibrated values. Right now I iterate through the whole table and use least squares to determine the closest match. Its pretty straightforward and I'm not sure there is a faster way without applying some extra logic to my table. But if I at least knew how long certain things took, I could probably optimize it myself.
is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers?
Not on an ARM architecture which has native 32-bit registers, no.
Anyone know a place to look?
The canonical place for instruction cycle timings would be the Tech Ref Manual for the particular architecture your chip implements, eg. ARM7TDMI; timings for simple alu ops here and yes, it is one cycle. This is not friendly doc to be reading if you're not already well familiar with the instruction set, though...
Right now I iterate through the whole table
You'll be much better off looking at algorithmic optimisations here (eg indexing the table, sorting by one co-ordinate to narrow it down, etc) than worrying about instruction-level micro-optimisations.
A good first stage could be to study the assembly language of the architecture you are coding for.
After you should be able to read the binary file generated by your compiler and finally compare what the computer will really have to do with two different implementation.
You can use the timers in your SAM7S. Read a timer on start, and read it after N number of searches and subtract to get the difference. Try different algorithms and see what you see.
As far as 16 bit math vs 32 bit math, yes there can be a huge difference, but you have to look at your architecture. A subtract operation between two registers will take the same one clock be it 16 bit or 32 bit. But coming from C code eventually the variables may land in memory and you have to know if you have a 16 bit or 32 bit data bus (yes ARM7s can have a 16 bit bus, look at the GameBoy Advance, thumb code runs significantly faster than ARM code on that processor). Takes twice as many cycles to read or write 32 bit numbers on a 16 but bus. You likely do NOT have a 16 bit bus though. Using 16 bit variables on a 32 bit processor causes the processor to have to add extra instructions to strip or extend the upper bits so that the math is correct for a 16 bit variable. Those extra instructions can cause performance hits, a simple subtract which might have been say 3 or 4 instructions worst case might now be 5 or 6 and that is noticeable if it is in a tight loop. Generally you want to use variables that match the processors register size, on a 32 bit ARM use 32 bit variables as much as possible even if you are only counting to 10.
Hopefully I am understanding the problem you are trying to solve here, if not let me know and I will edit/remove this response:
Depending on how many bits in your measurement the typical solution for what you are doing is to use a look up table. So that I can show an example lets say you are taking a 4 bit measurement that you want to calibrate. Call it 0 to 15. Calibration of the sensor generated a list of data points, lets say:
raw cal
0x03 16
0x08 31
0x14 49
I assume what you are doing runtime is something like this, if the sensor reads a 0x5 you would look through the list looking for entries your sensor reading matches or is between two of the cal points.
searching you will find it to be between 0x03 and 0x08 to get the calibrated result from the raw 0x05 measurement
cal= (((0x05-0x03)/(0x08-0x03))*(31-16)+16 = 22
You have a divide in there which is a HUGE performance killer on most processors, ARM7 in particular as it doesnt have a divide. Not sure about the multiply but you want to avoid those like the plague as well. And if you think about how many instructions all of that takes.
Instead what you do is take the algorithm you are using run-time, and in an ad-hoc program generate all the possible outputs from all the possible inputs:
0 7
1 10
2 13
3 16
4 19
5 22
6 25
7 28
8 31
9 34
10 37
11 40
12 43
13 46
14 49
15 52
Now turn that into a table in your run-time code:
unsigned char cal_table[16]={7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52};
and then runtime
cal = cal_table[raw&15];
The code to implement this looks something like:
ldr r3, =cal_table
and r0, r0, #15
ldrb r0, [r3, r0]
takes like 5 clocks to execute.
Just the math to find cal from raw after you have searched through the table:
cal= (((raw-xlo)/(xhi-xlo))*(yhi-ylo)+ylo);
looks something like this:
docal:
stmfd sp!, {r3, r4, r5, lr}
ldr r3, .L2
ldr r5, .L2+4
ldr lr, .L2+8
ldr ip, [r5, #0]
ldr r0, [r3, #0]
ldr r1, [lr, #0]
ldr r2, .L2+12
rsb r0, ip, r0
rsb r1, ip, r1
ldr r5, [r2, #0]
bl __aeabi_uidiv
ldr r4, .L2+16
ldr r3, .L2+20
ldr r4, [r4, #0]
rsb r5, r4, r5
mla r4, r0, r5, r4
str r4, [r3, #0]
ldmfd sp!, {r3, r4, r5, pc}
And the divide function is as bad if not worse. The look up table should make your code run dozens of times faster.
The problem with look up tables is you trade memory for performance so you have to have a table big enough to cover all the possible inputs. A 12 bit sensor would give you as many as 4096 entries in the look up table for example. If say you knew the measurement would never be below 0x100 you could make the table 0x1000 - 0x100 or 3840 entries deep and subtract 0x100 from the raw value before looking it up, trading an couple of instructions at run time to save a few hundred bytes of memory.
If the table would be too big you could try some other tricks like make a look up table of the upper bits, and the output of that might be a pre-computed offset into the cal table to start your search. So if you had a 12 bit ADC, but didnt have room for a 4096 entry look up table you could make a 16 entry look up table, take the upper 4 bits of the ADC output and use it to look in the table. The table would contain the entry in the cal table to start searching. Say your cal table had these entries:
....
entry 27 raw = 0x598 cal = 1005
entry 28 raw = 0x634 cal = 1600
entry 29 raw = 0x6AB cal = 1800
entry 30 raw = 0x777 cal = 2000
your 16 deep look up table would then have these entries
...
[6] = 27;
[7] = 29;
...
And how you would use it is
start = lut[raw>>8];
for(i=start;i<cal_tab_len;i++)
{
...
}
instead of
for(i=0;i<cal_tabl_len;i++)
{
}
It could potentially greatly shorten the time it takes to find the entry in the table for you to perform the math needed.
For the particular problem of taking a raw value and turning that into a calibrated value at runtime, there are many many similar shortcuts. I dont know of one book that would cover them all. Which path to take has a lot to do with your processor, memory system and availability, and the size and nature of your data. You generally want to avoid divides in particular and multiples sometimes if your processor does not support them (using very few clock cycles). Most processors do not. (Yes, the one or two processors most programmers target, do have a single cycle multiply and divide). Even for processors that have a single cycle multiply and divide they often have to be wrapped with a C library to decide if it is safe to perform the operation with the hardware instruction or if it has to be synthesized with a library. I mentioned above that for most variables you want to match the native register size of the processor. If you have fixed point multiplies or divides you will often want to use half the register size of the processor. A 32 bit processor, unless you take the time to examine the instructions in detail, you probably want to limit your multiples to 16 bit inputs with a 32 bit output and divides to 32 bit inputs with a 16 bit output and hope the optimizer helps you out.
Again, If I have assumed incorrectly what the problem you were trying to solve is please comment and I will edit/modify this response.
I'm reading an arm manual and come to this suggestion, but the reason is not mentioned.
Why unsigned types are faster?
Prior to ARMv4, ARM had no native support for loading halfwords and signed bytes. To load a signed byte you had to LDRB then sign extend the value (LSL it up then ASR it back down). This is painful so char is unsigned by default.
In ARMv4 instructions were added to handle halfwords and signed values. These new instructions had to be squeezed into the available instruction space. Limits on the space available meant that they could not be made as flexible as the original instructions, which are able to do various address computations when loading the value.
So you may find that LDRSB, for example, is unable to combine a fetch from memory with an address computation whereas LDRB could. This can cost cycles. Sometimes we can rework short-heavy code to operate on pairs of ints to avoid this.
There's more info on my site here: http://www.davespace.co.uk/arm/efficient-c-for-arm/memaccess.html
I think it's just that the instruction set for ARM CPUs is optimized for unsigned. Some operations can be done with one instruction for unsigned types but will need multiple instructions if it's signed. That's why I think if compiling for ARM in most (all?) C and C++ compilers it defaults to unsigned char rather than the more usual signed char.
The only advantages of unsigned types I can think of are that division and modulo implementations may be slightly faster, and you can do tests like if (unsigned_value < limit) rather than if (signed_value >= 0 && signed_value < limit).
I suspect your manual may be out of date. Any ARM in use today will have v4 or later of the instruction set, and I'm pretty sure that no instructions are faster or slower depending on signedness.
On older ARMs, I believe that signed multiplication could be slower; I think that early termination only looked for all zeros in the top bits, not all ones, so multiplications involving negative numbers would always take the maximum time. Although this depended on the value, not on whether the type was signed or unsigned. On at least ARMv4 and later, early termination works for negative values.
Also, I think very early ARMs couldn't load a single byte, only a word. So you'd need two instructions to load an unsigned byte, and three to load a signed one:
ldr r0, [r1]
and r0, r0, #0xff
versus
ldr r0, [r1]
mov r0, r0, asl #24
mov r0, r0, asr #24 ; but this could maybe be combined with later instructions
versus (these days) ldrb r0, [r1] or ldrsb r0, [r1] to do a single-byte load.
On a modern processor, it's very unlikely that using unsigned types will have a measurable impact on performance. Use whichever type makes most sense, then look at the code in detail once you've identified any performance bottlenecks.