In arm assembly language, the instruction ADCS will add with condition flags C and set condition flags.
And the CMP instruction do the same things, so the condition flags will be recovered.
How can I solve it ?
This is my code, it is doing BCD adder with r0 and r1 :
ldr r8, =#0
ldr r9, =#15
adds r7, r8, #0
ADDLOOP:
and r4, r0, r9
and r5, r1, r9
adcs r6, r4, r5
orr r7, r6, r7
add r8, r8, #1
mov r9, r9, lsl #4
cmp r8, #3
bgt ADDEND
bl ADDLOOP
ADDEND:
mov r0, r7
I tried to save the state of condition flags, but I don't know how to do.
To save/restore the Carry flag, you could create a 0/1 integer in a register (perhaps with adc reg, zeroed_reg, #0?), then next iteration cmp reg, #1 or rsbs reg, reg, #1 to set the carry flag from it.
ARM can't materialize C as an integer 0/1 with a single instruction without any setup; compilers normally use movcs r0, #1 / movcc r0, #0 when not in a loop (Godbolt), but in a loop you'd probably want to zero a register once outside the loop instead of using two instructions predicated on carry-set / carry-clear.
Loop without modifying C
Use teq r8, #4 / bne ADDLOOP as the loop branch, like the bottom of a do{}while(r8 != 4).
Or count down from 4 with tst r8,r8 / bne ADDLOOP, using sub r8, #1 instead of add.
TEQ updates N and Z but not C or V flags. (Unless you use a shifted source operand, then it can update C). docs - unlike cmp, it sets flags like eors. The eq / ne conditions work the same: subtraction and XOR both produce zero when the inputs are equal, and non-zero in every other case. But teq doesn't even set C or V flags, and greater / less wouldn't be meaningful anyway.
This is what optimized BigInt code like GMP does, for example in its mpn_add_n function (source) which adds two bigint inputs (arrays of 32-bit chunks).
IDK why you were jumping forwards over a bl (branch-and-link) which sets lr as a return address. Don't do that, structure your asm loops like a do{}while() because it's more efficient, especially when the trip-count is known to be non-zero so you don't have to worry about running the loop zero times in some cases.
There are cbz/cbnz instructions (docs) that jump on a register being zero or non-zero without affecting flags, but they can only jump forwards (out of the loop, past an unconditional branch). They're also only available in Thumb mode, unlike teq which was probably specifically designed to give ARM an efficient way to write BigInt loops.
BCD adding
Your algorithm has bugs; you need base-10 carry, like 0x05 + 0x06 = 0x11 not 0x0b in packed BCD.
And even the binary Carry flag isn't set by something like 0x0005000 + 0x0007000; there's no carry-out from the high bit, only into the next nibble. Also, adc adds the carry-in at the bottom of the register, not at nibble your mask isolated.
So maybe you need to do something like subtract 0x000a000 from the sum (for that example shift position), because that will carry-out. (ARM sets C as a !borrow on subtraction, so maybe rsb reverse-subtract or swap the operands.)
NEON should make it possible to unpack to 8-bit elements (mask odd/even and interleave) and do all nibbles in parallel, but carry propagation is a problem; ARM doesn't have an efficient way to branch on SIMD vector conditions (unlike x86 pmovmskb). Just byte-shifting the vector and adding could generate further carries, as with 999999 + 1.
IDK if this can be cut down effectively with the same techniques hardware uses, like carry-select or carry-lookahead, but for 4-bit BCD digits with SIMD elements instead of single bits with hardware full-adders.
It's not worth doing for binary bigint because you can work in 32 or 64-bit chunks with the carry flag to help, but maybe there's something to gain when primitive hardware operations only do 4 bits at a time.
I noticed that armcc generates this kind of code to compare two int64 values:
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
Which can be roughly translated as:
r0 = lower_word_1 ^ lower_word_2
r1 = higher_word_1 ^ higher_word_2
r0 = r1 | r0
jump if r0 is not zero
and something like this, when comparing int64 (int r0,r1) with integral constant (i.e. int, in r3)
0x08000674 4058 EORS r0,r0,r3
0x08000676 4308 ORRS r0,r0,r1
0x08000678 D116 BNE 0x080006A8
with the same idea, just skipping comparing higher words altogether since it just needs to be zero.
but I'm interested - why is it so complicated?
Both cases can be done very straight-forward by comparing lower and higher words and making BNE after both:
for two int64, assuming the same registers
CMP lower words
BNE
CMP higher words
BNE
and for int64 with integral constant:
CMP lower words
BNE
CBNZ if higher word is non-zero
This will take the same number of instructions, each may (or may not, depending on the registers used) be 2 bytes in length.
arm-none-eabi-gcc does something different but no playing around with EORS either
So why armcc does this? I can't see any real benefit; both version require the same number of commands (each of which my be wide or short, so no real profit there).
The only slight benefit I can see is that less branching which my be somewhat beneficial for a flash prefetch buffer. But since there is no cache or branch prediction, I'm not really buying it.
So my reasoning is that this pattern is simply legacy, from ARM7 Architecture where no CBZ/CBNZ existed and mixing ARM and Thumb instructions was not very easy.
Am I missing something?
P.S. Armcc does this on every optimization level so I presume it is some kind of 'hard-coded' piece
UPD: Sure, there is an execution pipeline that will be flushed with every branch taken, however every solution requires at least one conditional branch that will or will not be taken (depending on integers that are compared), so pipeline will be flushed anyway with equal probability.
So I can't really see a point in minimizing conditional branches.
Moreover, if lower and higher words would be compared explicitly and integers are not equal, branch will be taken sooner.
Avoiding branch instruction completely is possible with IT-block but on Cortex-M3 it can be only up to 4 instructions long so I'm gonna ignore this for generality.
The efficiency of the generated code is not counted in the number of the machine code instructions. You need to know the internals of the target machine as well (not only the clock/instruction) but also how the fetch/decode/execute process works.
Every branch instruction in the Cortex M3 devices flushes the pipeline. Pipeline has to be fed again. If you run from FLASH memory (it is slow) wait states will also significantly slow this process. The compiler tries to avoid branches as much as it is possible.
It can be done your way using other instructions:
int foo(int64_t x, int64_t y)
{
return x == y;
}
cmp r1, r3
itte eq
cmpeq r0, r2
moveq r0, #1
movne r0, #0
bx lr
Trust your compiler. People who write them know their trade :). Before you learn more about the ARM Cortex you cant judge the compiler this simple way as you do now.
The code from your example is very well optimized and simple. Keil does a very good job.
As pointed out the difference is branching vs not branching. If you can avoid branching you want to avoid branching.
While the ARM documentation may be interesting, as with an x86 and a full sized ARM and many other places the system plays as of a role here. High performance cores like ones from ARM are sensitive to the system implementation. These cortex-m cores are used in microcontrollers which are quite cost sensitive, so while they blow away a PIC or AVR or msp430 for mips to mhz and mips per dollar they are still cost sensitive. With newer technology or perhaps higher cost, you are starting to see flashes that are at the speed of the processor for the full range (do not have to add wait states at various places across the range of valid clock speeds), but for a long time you saw the flash at half the speed of the core at the slowest core speeds. And then getting worse as you choose higher core speeds. But sram often matching the core. Either way flash is a major portion of the cost of the part and how much and how fast it is to some extent drives part price.
Depending on the core (anything from ARM) the fetch size and as a result alignment varies and as a result benchmarks can be skewed/manipulated based on alignment of a loop style test and how many fetches are needed (trivial to demonstrate with many cortex-ms). The cortex-ms are generally either a halfword or full word fetch and some are compile time options for the chip vendor (so you might have two chips with the same core but the performance varies). And this can be demonstrated too...just not here...unless pushed, I have done this demo too many times at this site now. But we can manage that here in this test.
I do not have a cortex-m3 handy I would have to dig one out and wire it up if need be, should not need to though have a cortex-m4 handy which is also an armv7-m. A NUCLEO-F411RE
Test fixture
.thumb_func
.globl HOP
HOP:
bx r2
.balign 0x20
.thumb_func
.globl TEST0
TEST0:
push {r4,r5}
mov r4,#0
mov r5,#0
ldr r2,[r0]
t0:
cmp r4,r5
beq skip
skip:
subs r1,r1,#1
bne t0
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
The systick timer generally works just fine for these kinds of tests, no need to mess with the debuggers timer it often just shows the same thing with more work. More than enough here.
Called like this with the result printed out in hex
hexstring(TEST0(STK_CVR,0x10000));
hexstring(TEST0(STK_CVR,0x10000));
copy the flash code to ram and execute there
hexstring(HOP(STK_CVR,0x10000,0x20000001));
hexstring(HOP(STK_CVR,0x10000,0x20000001));
Now the stm32's have this cache thing in front of the flash which affects loop based benchmarks like these as well as other benchmarks against these parts, sometimes you cannot get past that and you end up with a bogus benchmark. But not in this case.
To demonstrate fetch effects you want a system delay in fetching, if the fetches are too fast you might not see the fetch effects.
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d1ff bne.n 8000030 <skip>
08000030 <skip>:
00050001 <-- flash time
00050001 <-- flash time
00060004 <-- sram time
00060004 <-- sram time
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d0ff beq.n 8000030 <skip>
08000030 <skip>:
00060001
00060001
00080000
00080000
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: bf00 nop
08000030 <skip>:
00050001
00050001
00060000
00060000
So we can see that if the branch is not taken it is the same as a nop. As far as this loop based test goes. So perhaps there is a branch predictor (often a small cache that remembers the last N number of branches and their destinations and can start prefetch a clock or two early). I did not dig into it yet, did not really need to as we can already see that there is a performance cost due to a branch that has to be taken (making your suggested code not equal despite the same number of instructions, this is the same number of instructions but not equal performance).
So the quickest way to remove the loop and avoid the stm32 cache thing is to do something like this in ram
push {r4,r5}
mov r4,#0
mov r5,#0
cmp r4,r5
ldr r2,[r0]
instruction under test repeated many times
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
with the instruction under test being a bne to the next, a beq to the next or a nop
// 800002e: d1ff bne.n 8000030 <skip>
00002001
// 800002e: d0ff beq.n 8000030 <skip>
00004000
// 800002e: bf00 nop
00001001
I did not have room for 0x10000 instructions so I used 0x1000, and we can see that there is a hit for both branch types with the one that does branch being more costly.
Note that the loop based benchmark did not show this difference, have to be careful doing benchmarks or judging results. Even the ones I have shown here.
I could spend more time tweaking core settings or system settings, but based on experience I think this has already demonstrated the desire not to have a cmp, bne, cbnz replace eor, orr, bne. Now to be fair, your other one where it is a eor.w (thumb2 extensions) that burns more clocks than thumb2 instructions so there is another thing to consider (I measured it as well).
Remember for these high performance cores you need to be very sensitive to fetching and fetch alignment, very easy to make a bad benchmark. Not that an x86 is not high performance, but to make the inefficient core run smoother there is a ton of stuff around it to try to keep the core fed, similar to running a semi-truck vs a sports car, the truck can be efficient once up to speed on the highway but city driving, not so much even keeping to the speed limit a Yugo will get across town faster than the semi truck (if it does not break down). Fetch effects, unaligned transfers, etc are difficult to see in an x86, but an ARM somewhat easy, so to get the best performance you want to avoid the easy cycle eaters.
Edit
Note that I jumped to conclusions too early about what GCC produces. Had to work more on trying to craft an equivalent comparison. I started with
unsigned long long fun2 ( unsigned long long a)
{
if(a==0) return(1);
return(0);
}
unsigned long long fun3 ( unsigned long long a)
{
if(a!=0) return(1);
return(0);
}
00000028 <fun2>:
28: 460b mov r3, r1
2a: 2100 movs r1, #0
2c: 4303 orrs r3, r0
2e: bf0c ite eq
30: 2001 moveq r0, #1
32: 4608 movne r0, r1
34: 4770 bx lr
36: bf00 nop
00000038 <fun3>:
38: 460b mov r3, r1
3a: 2100 movs r1, #0
3c: 4303 orrs r3, r0
3e: bf14 ite ne
40: 2001 movne r0, #1
42: 4608 moveq r0, r1
44: 4770 bx lr
46: bf00 nop
Which used an it instruction which is a natural solution here since the if-then-else cases can be a single instruction. Interesting that they chose to use r1 instead of the immediate #0 I wonder if that is a generic optimization, due to complexity with immediates on a fixed length instruction set or perhaps immediates take less space on some architectures. Who knows.
800002e: bf0c ite eq
8000030: bf00 nopeq
8000032: bf00 nopne
00003002
00003002
800002e: bf14 ite ne
8000030: bf00 nopne
8000032: bf00 nopeq
00003002
00003002
Using sram 0x1000 sets of three instructions linearly, so 0x3002 means 1 clock per instruction on average.
Putting a mov in the it block doesn't change performance
ite eq
moveq r0, #1
movne r0, r1
It is still one clock per.
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a)
{
for(;a!=0;a--)
{
more_fun(5);
}
return(0);
}
48: b538 push {r3, r4, r5, lr}
4a: ea50 0301 orrs.w r3, r0, r1
4e: d00a beq.n 66 <fun4+0x1e>
50: 4604 mov r4, r0
52: 460d mov r5, r1
54: 2005 movs r0, #5
56: f7ff fffe bl 0 <more_fun>
5a: 3c01 subs r4, #1
5c: f165 0500 sbc.w r5, r5, #0
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
66: 2000 movs r0, #0
68: 2100 movs r1, #0
6a: bd38 pop {r3, r4, r5, pc}
This is basically the compare with zero
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
Against another
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a, unsigned long long b)
{
for(;a!=b;a--)
{
more_fun(5);
}
return(0);
}
00000048 <fun4>:
48: 4299 cmp r1, r3
4a: bf08 it eq
4c: 4290 cmpeq r0, r2
4e: d011 beq.n 74 <fun4+0x2c>
50: b5f8 push {r3, r4, r5, r6, r7, lr}
52: 4604 mov r4, r0
54: 460d mov r5, r1
56: 4617 mov r7, r2
58: 461e mov r6, r3
5a: 2005 movs r0, #5
5c: f7ff fffe bl 0 <more_fun>
60: 3c01 subs r4, #1
62: f165 0500 sbc.w r5, r5, #0
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
6e: 2000 movs r0, #0
70: 2100 movs r1, #0
72: bdf8 pop {r3, r4, r5, r6, r7, pc}
74: 2000 movs r0, #0
76: 2100 movs r1, #0
78: 4770 bx lr
7a: bf00 nop
And they choose to use an it block here.
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
It is on par with this for number of instructions.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
But those thumb2 instructions are going to execute longer. So overall I think GCC appears to have made a better sequence, but of course you want to check apples to apples start with the same C code and see what each produced. The gcc one reads easier than the eor/orr stuff, can think less about what it is doing.
8000040: 406c eors r4, r5
00001002
8000042: ea94 0305 eors.w r3, r4, r5
00002001
0x1000 instructions one is two halfwords (thumb2) one is one halfword (thumb). Takes two clocks not really surprised.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
I see six clocks there before adding any other penalties, not four (on this cortex-m4).
Note I made the eors.w aligned and unaligned and it did not change the performance. Still two clocks.
I'm new to arm programming and I am trying to understand what the following code does:
.macro set_bit reg_addr bit
ldr r4, =\reg_addr
ldr r5, [r4]
orr r5, #(1 << \bit)
str r5, [r4]
.endm
In particular I am confused about the orr r5,#(1<< \bit) part, I understand that orr stands for logical orr but I am not sure what that means in the given context. I think #(1<<\bit) is seeing if 1 is greater than the given bit, so that will return a true or false statement but I am not sure what the orr command will do.
You are right about the ORR instruction is used for performing logical OR operation. The instruction is used in the following format in the context of this question-
ORR {Register} {Constant}
Now, the constant here is (1 <<\bit) which basically means to left shift 1 by the \bit amount. Here bit is a number between 0-31 which decides which bit needs to be set. The ARM instruction set allows the constant to have such shifting operations as well.
Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.
Here's a disassembler listing of some code I'm looking at:
//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00 nop
14: f04f 0300 mov.w r3, #0
18: f2c4 0302 movt r3, #16386 ; 0x4002
1c: 781b ldrb r3, [r3, #0]
1e: b2db uxtb r3, r3
20: b2db uxtb r3, r3
22: b25b sxtb r3, r3
24: 2b00 cmp r3, #0
26: daf5 bge.n 14 <flash_command+0x14>
The constants (after expending macros, etc.) are:
address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u
This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.
I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?
What am I missing here?
Edit: Types of the thing involved:
So the actual macro expansion of FTFE_FSTAT comes down to
((((FTFE_MemMapPtr)0x40020000u))->FSTAT)
where the struct is defined as
/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
//... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;
The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.
The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.
The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.
Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.
The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.
So my problem is one I though was rather simple and I have an algorithm, but I can't seem to make it work using thumb-2 instructions.
Amway, I need to reverse the bits of r0, and I thought the easiest way to do this would be to Logically shift the number right into a temporary register and then shift that left into the result register. However LSL, LSR don't seem to allow you to store the shifted bit that is lost to the Most significant bit or least significant bit(while also shifting the bits of that register). Is there some part of the instruction I am miss understanding.
This is my ARM reference:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Cjacbgca.html
The bit being shifted out can be copied into the C bit (carry flag) if you use the S suffix ("set flags"). And the RRX instruction uses C to set the bit 31 of the result. So you can probably do something like:
; 32 iterations
MOV R2, #32
; init result
MOV R1, #0
loop
; copy R0[31] into C and shift R0 to left
LSLS R0, R0, #1
; shift R1 to right and copy C into R1[31]
RRX R1, R1
; decrement loop counter
SUBS R2, #1
BNE loop
; copy result back to R0
MOV R0, R1
Note that this is a pretty slow way of reversing bits. If RBIT is available, you should use it, otherwise check some bit twiddling tricks.
How about using the rbit instruction? My copy of the ARMARM shows it having a Thumb-2 Encoding in ARMv6T2 and above.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cihjgdid.html