How to Compare two registers and perform action if greater than without branching in ARM - arm

I am trying to compare two registers r5 and r6 which I know I can do with
CMP R7, R5
What I am trying to do is
if R7 > 1 then ADD R8, R8, #1 Without branching as I will be using this multiple times in different sections of the code
I know BGT will branch if greater than, or if its possible to return to previous position after branching to add to count?

Many ARM instructions are defined as OP{cond}, this means you can make the execution of this instruction depend on a condition:
cmp r5, r7
addgt r8, r8, #1 //increments r8 if r5 is greater than r7
mov r1, r0 //executes in any case

Related

KEIL: The reason for working a MUL within a loop only in the 1st iteration

I have written a code in keil in which there is a loop containing a multiplication which can be done either by shift left by 1 or MUL command. However, both are executed only in the 1st iteration and in the next ones, compiler executes the code, while nothing happens.
Is there someone who knows the probable reason?
Thanks
Here is the code of the loop:
loop ADD r9, r9, #1
;MUL r8, r7, r11 ;double the X
LSL r12, r7, #1
CMP r12, r4 ;CMP the X with P
BHI G1
BLS G0
it was solved. it occured because this is a consecutive shift and it should be shifted by 1 at each iteration. the written code copies the source into destination and then, shift by 1. So it is not modified after the 1st iteration. the solution is modifying r7 by LSL r7, #1. This is shifted r7 by 1 position at all iterations.

Keil ARMCC int64 comparison for Cortex M3

I noticed that armcc generates this kind of code to compare two int64 values:
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
Which can be roughly translated as:
r0 = lower_word_1 ^ lower_word_2
r1 = higher_word_1 ^ higher_word_2
r0 = r1 | r0
jump if r0 is not zero
and something like this, when comparing int64 (int r0,r1) with integral constant (i.e. int, in r3)
0x08000674 4058 EORS r0,r0,r3
0x08000676 4308 ORRS r0,r0,r1
0x08000678 D116 BNE 0x080006A8
with the same idea, just skipping comparing higher words altogether since it just needs to be zero.
but I'm interested - why is it so complicated?
Both cases can be done very straight-forward by comparing lower and higher words and making BNE after both:
for two int64, assuming the same registers
CMP lower words
BNE
CMP higher words
BNE
and for int64 with integral constant:
CMP lower words
BNE
CBNZ if higher word is non-zero
This will take the same number of instructions, each may (or may not, depending on the registers used) be 2 bytes in length.
arm-none-eabi-gcc does something different but no playing around with EORS either
So why armcc does this? I can't see any real benefit; both version require the same number of commands (each of which my be wide or short, so no real profit there).
The only slight benefit I can see is that less branching which my be somewhat beneficial for a flash prefetch buffer. But since there is no cache or branch prediction, I'm not really buying it.
So my reasoning is that this pattern is simply legacy, from ARM7 Architecture where no CBZ/CBNZ existed and mixing ARM and Thumb instructions was not very easy.
Am I missing something?
P.S. Armcc does this on every optimization level so I presume it is some kind of 'hard-coded' piece
UPD: Sure, there is an execution pipeline that will be flushed with every branch taken, however every solution requires at least one conditional branch that will or will not be taken (depending on integers that are compared), so pipeline will be flushed anyway with equal probability.
So I can't really see a point in minimizing conditional branches.
Moreover, if lower and higher words would be compared explicitly and integers are not equal, branch will be taken sooner.
Avoiding branch instruction completely is possible with IT-block but on Cortex-M3 it can be only up to 4 instructions long so I'm gonna ignore this for generality.
The efficiency of the generated code is not counted in the number of the machine code instructions. You need to know the internals of the target machine as well (not only the clock/instruction) but also how the fetch/decode/execute process works.
Every branch instruction in the Cortex M3 devices flushes the pipeline. Pipeline has to be fed again. If you run from FLASH memory (it is slow) wait states will also significantly slow this process. The compiler tries to avoid branches as much as it is possible.
It can be done your way using other instructions:
int foo(int64_t x, int64_t y)
{
return x == y;
}
cmp r1, r3
itte eq
cmpeq r0, r2
moveq r0, #1
movne r0, #0
bx lr
Trust your compiler. People who write them know their trade :). Before you learn more about the ARM Cortex you cant judge the compiler this simple way as you do now.
The code from your example is very well optimized and simple. Keil does a very good job.
As pointed out the difference is branching vs not branching. If you can avoid branching you want to avoid branching.
While the ARM documentation may be interesting, as with an x86 and a full sized ARM and many other places the system plays as of a role here. High performance cores like ones from ARM are sensitive to the system implementation. These cortex-m cores are used in microcontrollers which are quite cost sensitive, so while they blow away a PIC or AVR or msp430 for mips to mhz and mips per dollar they are still cost sensitive. With newer technology or perhaps higher cost, you are starting to see flashes that are at the speed of the processor for the full range (do not have to add wait states at various places across the range of valid clock speeds), but for a long time you saw the flash at half the speed of the core at the slowest core speeds. And then getting worse as you choose higher core speeds. But sram often matching the core. Either way flash is a major portion of the cost of the part and how much and how fast it is to some extent drives part price.
Depending on the core (anything from ARM) the fetch size and as a result alignment varies and as a result benchmarks can be skewed/manipulated based on alignment of a loop style test and how many fetches are needed (trivial to demonstrate with many cortex-ms). The cortex-ms are generally either a halfword or full word fetch and some are compile time options for the chip vendor (so you might have two chips with the same core but the performance varies). And this can be demonstrated too...just not here...unless pushed, I have done this demo too many times at this site now. But we can manage that here in this test.
I do not have a cortex-m3 handy I would have to dig one out and wire it up if need be, should not need to though have a cortex-m4 handy which is also an armv7-m. A NUCLEO-F411RE
Test fixture
.thumb_func
.globl HOP
HOP:
bx r2
.balign 0x20
.thumb_func
.globl TEST0
TEST0:
push {r4,r5}
mov r4,#0
mov r5,#0
ldr r2,[r0]
t0:
cmp r4,r5
beq skip
skip:
subs r1,r1,#1
bne t0
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
The systick timer generally works just fine for these kinds of tests, no need to mess with the debuggers timer it often just shows the same thing with more work. More than enough here.
Called like this with the result printed out in hex
hexstring(TEST0(STK_CVR,0x10000));
hexstring(TEST0(STK_CVR,0x10000));
copy the flash code to ram and execute there
hexstring(HOP(STK_CVR,0x10000,0x20000001));
hexstring(HOP(STK_CVR,0x10000,0x20000001));
Now the stm32's have this cache thing in front of the flash which affects loop based benchmarks like these as well as other benchmarks against these parts, sometimes you cannot get past that and you end up with a bogus benchmark. But not in this case.
To demonstrate fetch effects you want a system delay in fetching, if the fetches are too fast you might not see the fetch effects.
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d1ff bne.n 8000030 <skip>
08000030 <skip>:
00050001 <-- flash time
00050001 <-- flash time
00060004 <-- sram time
00060004 <-- sram time
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d0ff beq.n 8000030 <skip>
08000030 <skip>:
00060001
00060001
00080000
00080000
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: bf00 nop
08000030 <skip>:
00050001
00050001
00060000
00060000
So we can see that if the branch is not taken it is the same as a nop. As far as this loop based test goes. So perhaps there is a branch predictor (often a small cache that remembers the last N number of branches and their destinations and can start prefetch a clock or two early). I did not dig into it yet, did not really need to as we can already see that there is a performance cost due to a branch that has to be taken (making your suggested code not equal despite the same number of instructions, this is the same number of instructions but not equal performance).
So the quickest way to remove the loop and avoid the stm32 cache thing is to do something like this in ram
push {r4,r5}
mov r4,#0
mov r5,#0
cmp r4,r5
ldr r2,[r0]
instruction under test repeated many times
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
with the instruction under test being a bne to the next, a beq to the next or a nop
// 800002e: d1ff bne.n 8000030 <skip>
00002001
// 800002e: d0ff beq.n 8000030 <skip>
00004000
// 800002e: bf00 nop
00001001
I did not have room for 0x10000 instructions so I used 0x1000, and we can see that there is a hit for both branch types with the one that does branch being more costly.
Note that the loop based benchmark did not show this difference, have to be careful doing benchmarks or judging results. Even the ones I have shown here.
I could spend more time tweaking core settings or system settings, but based on experience I think this has already demonstrated the desire not to have a cmp, bne, cbnz replace eor, orr, bne. Now to be fair, your other one where it is a eor.w (thumb2 extensions) that burns more clocks than thumb2 instructions so there is another thing to consider (I measured it as well).
Remember for these high performance cores you need to be very sensitive to fetching and fetch alignment, very easy to make a bad benchmark. Not that an x86 is not high performance, but to make the inefficient core run smoother there is a ton of stuff around it to try to keep the core fed, similar to running a semi-truck vs a sports car, the truck can be efficient once up to speed on the highway but city driving, not so much even keeping to the speed limit a Yugo will get across town faster than the semi truck (if it does not break down). Fetch effects, unaligned transfers, etc are difficult to see in an x86, but an ARM somewhat easy, so to get the best performance you want to avoid the easy cycle eaters.
Edit
Note that I jumped to conclusions too early about what GCC produces. Had to work more on trying to craft an equivalent comparison. I started with
unsigned long long fun2 ( unsigned long long a)
{
if(a==0) return(1);
return(0);
}
unsigned long long fun3 ( unsigned long long a)
{
if(a!=0) return(1);
return(0);
}
00000028 <fun2>:
28: 460b mov r3, r1
2a: 2100 movs r1, #0
2c: 4303 orrs r3, r0
2e: bf0c ite eq
30: 2001 moveq r0, #1
32: 4608 movne r0, r1
34: 4770 bx lr
36: bf00 nop
00000038 <fun3>:
38: 460b mov r3, r1
3a: 2100 movs r1, #0
3c: 4303 orrs r3, r0
3e: bf14 ite ne
40: 2001 movne r0, #1
42: 4608 moveq r0, r1
44: 4770 bx lr
46: bf00 nop
Which used an it instruction which is a natural solution here since the if-then-else cases can be a single instruction. Interesting that they chose to use r1 instead of the immediate #0 I wonder if that is a generic optimization, due to complexity with immediates on a fixed length instruction set or perhaps immediates take less space on some architectures. Who knows.
800002e: bf0c ite eq
8000030: bf00 nopeq
8000032: bf00 nopne
00003002
00003002
800002e: bf14 ite ne
8000030: bf00 nopne
8000032: bf00 nopeq
00003002
00003002
Using sram 0x1000 sets of three instructions linearly, so 0x3002 means 1 clock per instruction on average.
Putting a mov in the it block doesn't change performance
ite eq
moveq r0, #1
movne r0, r1
It is still one clock per.
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a)
{
for(;a!=0;a--)
{
more_fun(5);
}
return(0);
}
48: b538 push {r3, r4, r5, lr}
4a: ea50 0301 orrs.w r3, r0, r1
4e: d00a beq.n 66 <fun4+0x1e>
50: 4604 mov r4, r0
52: 460d mov r5, r1
54: 2005 movs r0, #5
56: f7ff fffe bl 0 <more_fun>
5a: 3c01 subs r4, #1
5c: f165 0500 sbc.w r5, r5, #0
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
66: 2000 movs r0, #0
68: 2100 movs r1, #0
6a: bd38 pop {r3, r4, r5, pc}
This is basically the compare with zero
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
Against another
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a, unsigned long long b)
{
for(;a!=b;a--)
{
more_fun(5);
}
return(0);
}
00000048 <fun4>:
48: 4299 cmp r1, r3
4a: bf08 it eq
4c: 4290 cmpeq r0, r2
4e: d011 beq.n 74 <fun4+0x2c>
50: b5f8 push {r3, r4, r5, r6, r7, lr}
52: 4604 mov r4, r0
54: 460d mov r5, r1
56: 4617 mov r7, r2
58: 461e mov r6, r3
5a: 2005 movs r0, #5
5c: f7ff fffe bl 0 <more_fun>
60: 3c01 subs r4, #1
62: f165 0500 sbc.w r5, r5, #0
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
6e: 2000 movs r0, #0
70: 2100 movs r1, #0
72: bdf8 pop {r3, r4, r5, r6, r7, pc}
74: 2000 movs r0, #0
76: 2100 movs r1, #0
78: 4770 bx lr
7a: bf00 nop
And they choose to use an it block here.
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
It is on par with this for number of instructions.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
But those thumb2 instructions are going to execute longer. So overall I think GCC appears to have made a better sequence, but of course you want to check apples to apples start with the same C code and see what each produced. The gcc one reads easier than the eor/orr stuff, can think less about what it is doing.
8000040: 406c eors r4, r5
00001002
8000042: ea94 0305 eors.w r3, r4, r5
00002001
0x1000 instructions one is two halfwords (thumb2) one is one halfword (thumb). Takes two clocks not really surprised.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
I see six clocks there before adding any other penalties, not four (on this cortex-m4).
Note I made the eors.w aligned and unaligned and it did not change the performance. Still two clocks.

How can I do this section of code, but using auto-indexing with ARM Assembly

this works, but I have to do it using auto-indexing and I can not figure out that part.
writeloop:
cmp r0, #10
beq writedone
ldr r1, =array1
lsl r2, r0, #2
add r2, r1, r2
str r2, [r2]
add r0, r0, #1
b writeloop
and for data I have
.balign 4
array1: skip 40
What I had tried was this, and yes I know it is probably a poor attempt but I am new to this and do not understand
ldr r1, =array1
writeloop:
cmp r0, #10
beq writedone
ldr r2, [r1], #4
str r2, [r2]
add r0, r0, #1
b writeloop
It says segmentation fault when I try this. What is wrong? What I am thinking should happen is every time it loops through, it sets the element r2 it at = to the address of itself, and then increments to the next element and does the same thing
The ARM architechures gives several different address modes.
From ARM946E-S product overview and many other sources:
Load and store instructions have three primary addressing modes
- offset
- pre-indexed
- post-indexed.
They are formed by adding or subtracting an immediate or register-based offset to or from a base register. Register-based offsets can also be scaled with shift operations. Pre-indexed and post-indexed addressing modes update the base register with the base plus offset calculation. As the PC is a general purpose register, a 32‑bit value can be loaded directly into the PC to perform a jump to any address in the 4GB memory space.
As well, they support write back or updating of the register, hence the reason for pre-indexed and post-indexed. Post-index doesn't make much sense without write back.
Now to your issue, I believe that you want to write the values 0-9 to an array of ten words (length four bytes). Assuming this, you can use indexing and update the value via add. This leads to,
mov r0, #0 ; start value
ldr r1, =array1 ; array pointer
writeloop:
cmp r0, #10
beq writedone
str r0, [r1, r0, lsl #2] ; index with r1 base by r0 scaled by *4
add r0, r0, #1
b writeloop
writedone:
; code to jump somewhere else and not execute data.
.balign 4
array1: skip 40
For interest a more efficient loop can be done by counting and writing down,
mov r0, #9 ; start value
ldr r1, =array1 ; array pointer
writeloop:
str r0, [r1, r0, lsl #2] ; index with r1 base by r0 scaled by *4
subs r0, r0, #1
bne writeloop
Your original example was writing the pointer to the array; often referred to as 'value equals address'. If this is what you want,
ldr r0, =array_end ; finished?
ldr r1, =array1 ; array pointer
write_loop:
str r1, [r1], #4 ; add four and update after storing
cmp r0, r1
bne write_loop
; code to jump somewhere else and not execute data.
.balign 4
array1: skip 40
array_end:

How to implement 16bit stereo mixing on ARMv6+?

I need to optimize my mixing code in c for faster response time, so i decided to use inline assembly to do a mixing of two buffers into a new bigger buffer. Basically i have left and right channels separated and i want to put them together into a buffer. So i need to put 2 bytes from the left channel and then two bytes from the right channel and so on.
for this i decided to send my 3 pointers to my assembly code where i intend to copy memory pointed by the left channel pointer into R0 register and memory pointed by right channel pointer into R1 afterwards i intend to mix R0 and R1 into R3 and R4 to later save those registers to memory.(I intend to use other free registers to do same procedure and reduce processing time with pipelining)
So I have two registers R0 and R1 with data, and need to mix them into R3 and R4, and i need to end up is R3 = R0HI(high-part) + R1HI(high-part) and R4 = R0LO(low-part) + R1LO(low-part)
I can think of using a bitwise shifts, but my questions is if there is an easier way to do it like in intel x86 architecture where you could transfer the data into ax register and then us ah as high part and al as low part?
is my thinking right? is there a faster way to do this?
my actual (not working) code in the ndk
void mux(short *pLeftBuf, short *pRightBuf, short *pOutBuf, int vecsamps_stereo) {
int iterations = vecsamps_stereo / 4;
asm volatile(
"ldr r0, %[outbuf];"
"ldr r1, %[leftbuf];"
"ldr r2, %[rightbuf];"
"ldr r3, %[iter];"
"ldr r4, [r3];"
"mov r8, r4;"
"mov r9, r0;"
"mov r4, #0;"
"mov r10, r4;"
"loop:; "
"ldr r2, [r1];"
"ldr r3, [r2];"
"ldr r7, =0xffff;"
"mov r4, r2;"
"and r4, r4, r7;"
"mov r5, r3;"
"and r5, r5, r7;"
"lsl r5, r5, #16;"
"orr r4, r4, r5;"
"lsl r7, r7, #16;"
"mov r5, r2;"
"and r5, r5, r7;"
"mov r6, r3;"
"and r6, r6, r7;"
"lsr r6, r6, #16;"
"orr r5, r5, r6;"
"mov r6, r9;"
"str r4, [r6];"
"add r6, r6, #1;"
"str r5, [r6];"
"add r6, r6, #1;"
"mov r9, r6;"
"mov r4, r10;"
"add r4, r4, #1;"
"mov r10, r4;"
"cmp r4, r8;"
"blt loop"
:[outbuf] "=m" (pOutBuf)
:[leftbuf] "m" (pLeftBuf) ,[rightbuf] "m" (pRightBuf),[iter] "m" (pIter)
:"r0","r1","r2","r3","memory"
);
}
I may not be 100% clear on what you are trying to do, but it looks like you want:
R3[31:16] = R0[31:16], R3[15:0] = R1[31:16];
R4[31:16] = R0[15:0], R4[15:0] = R1[15:0];
and not the actual sum.
In this case, you should be able to accomplish this relatively efficiently with a spare register for a 16-bit mask. ARM assembly offers shifting of a second operand as a part of most arithmetic or logical instructions.
MOV R2, 0xffff ; load 16-bit mask into lower half of R2
AND R3, R2, R1, LSR #16 ; R3 = R2 & (R1 >> 16), or R3[15:0] = R1[31:16]
ORR R3, R3, R0, LSR #16 ; R3 = R3 | (R0 >> 16), or R3[31:16] = R0[31:16]
AND R4, R2, R1 ; R4 = R2 & R1, or R4[15:0] = R1[15:0]
ORR R4, R4, R0, LSL #16 ; R4 = R4 | (R1 << 16), or R4[31:16] = R0[15:0]
; repeat to taste
Another option is to load just the 16 bits at a time, but this may be lower performance if your buffers are in slow memory, and it may not work at all if it doesn't support access less than 32 bits. I'm not certain if the core will request the 32 bits and mask out what isn't needed or if it relies on the memory to handle the byte lanes.
; assume R2 contains CH1 pointer, R3 contains CH2 pointer,
; and R1 contains output ptr
LDRH R0, [R2] ; load first 16 bits pointed to by CH1 into R0
STRH R0, [R1] ; store those 16 bites back into *output
LDRH R0, [R3] ; load first 16 bits pointed to by CH2 into R0
STRH R0, [R1, #2]! ; store those 16 bites back into *(output+2),
; write output+2 to R1
; after "priming" we can now run the following for
; auto increment of pointers.
LDRH R0, [R2, #2]! ; R0 = *(CH1+2), CH1 += 2
STRH R0, [R1, #2]! ; *(Out+2) = R0, Out += 2
LDRH R0, [R3, #2]! ; R0 = *(CH2+2), CH1 += 2
STRH R0, [R1, #2]! ; *(Out+2) = R0, Out += 2
; Lather, rinse, repeat.
These two examples make use of some of the handy features of ARM assembly. The first example makes use of the built in shift available on most instructions, while the second makes use of sized load/store instructions as well as the write-back on these instructions. These should both be compatible with Cortex-M cores. If you do have a more advanced ARM, #Notlikethat's answer is more suitable.
In terms of code size, when you add the load and store to the first example, you end up executing two load instructions, the four logic instructions, and two stores for a total of eight instructions for mixing two samples. The second examples uses two loads and two stores for a total of four instructions when mixing one sample, or, well, eight instructions for mixing two.
You will probably find the first example works faster, as it has fewer memory accesses, and the number of stores can be reduced by using a STM store multiple instruction (ie STMIA OutputRegister!, {R3, R4}). In fact, the first example can be pipelined a bit by using eight registers. LDMIA can be used to load four 16 bit samples from a channel in one instruction, perform two sets of the four mixing instructions, and then store the four output registers in one STMIA instruction. This likely wouldn't offer much benefit in performance since it will likely interact with the memory in the same manner (STM and LDM just execute multiple LDRs and STRs), but if you are optimizing for minimal instructions, this would result in 11 instructions to mix four samples (compared to 16).
ARM registers are strictly 32-bit, however provided you're on a recent enough core (v6+, but not Thumb-only v7-M) there are a number of suitable instructions for dealing with halfwords (PKHBT, PKHTB), or arbitrary slices of registers (BFI, UBFX), not to mention crazy parallel add/subtract instructions that frighten me (available with saturating arithmetic which can be useful for audio, too)..
However, if your machine implements NEON instructions they would be the route to the optimal implementation since this is exactly the sort of thing they're designed for. Plus, they should be accessible through compiler intrinsics so you can use them directly in C code.

ARM Loop stops executing instructions

I'm working on a program that counts the number of 1's and 0's of a binary string.
My CountOnes subroutine works as expected:
countOnes MOV r1, r0 ;mov random number into r1
TEQ r1, #0 ;test if all 0
MOVEQ pc, r14 ;if all 0's, brake
onesloop SUB r2, r1, #1 ;subtract 1
AND r1, r2, r1 ;and on r1 and r2
ADD r9, r9, #1 ;increment loop counter
TEQ r1, #0 ;test if all 0
BNE onesloop ;if not all 0's, loop
MOV pc, r14 ;return
countZeros MOV r1, r0 ;mov random number in r1
MVN r8, #0 ;fill r8 with 1's
TEQ r1, r8 ;test if number is all 1's
MOVEQ pc, r14 ;if all 1's, break
zerosloop ADD r2, r1, #1 ;add 1 to r2
AND r1, r2, r1 ;and r1 and r2
ADD r9, r9, #1 ;increment loop count
TEQ r1, r8 ;test if all ones
BNE zerosloop ;if not all 1's, loop
MOV pc, r14 ;return
However, my countZeros subroutine loops indefinitely, and after looking at it with a debugger, it turns out that the ADD and AND instructions are only executed the first time through the loop(which is the reason for the infinite loop), however, the entire loop is not broken, register 9 continues to be incremented every iteration.
I can't think of any reason for those instructions to stop being executed. Anyone encounter this behavior before or know what would cause it? If you need more info on anything just ask.
In your zerosloop you need to OR r1, r2, r1 instead of AND.
Also you should use CMN r1, #1 to find out whether r1 is all 1s.
This article on ARM population counts might be worth a look:
http://www.sciencezero.org/index.php?title=ARM:_Count_ones_(bit_count)

Resources