Unable to implement C Logic in ARM assembly - c

I have a C code in my mind which I want to implement in ARM Programming Language.
The C code I have in my mind is something of this sort:
int a;
scanf("%d",&a);
if(a == 0 || a == 1){
a = 1;
}
else{
a = 2;
}
What I have tried:
//arm equivalent of taking input to reg r0
//check for first condition
cmp r0,#1
moveq r0,#1
//if false
movne r0,#2
//check for second condition
cmp r0,#0
moveq r0,#1
Is this the correct way of implementing it?

Your code is broken for a=0 - single step through it in your head, or in a debugger, to see what happens.
Given this specific condition, it's equivalent to (unsigned)a <= 1U (because negative integer convert to huge unsigned values). You can do a single cmp and movls / movhi. Compilers already spot this optimization; here's how to ask a compiler to make asm for you so you can learn the tricks clever humans programmed into them:
int foo(int a) {
if(a == 0 || a == 1){
a = 1;
}
else{
a = 2;
}
return a;
}
With ARM GCC10 -O3 -marm on the Godbolt compiler explorer:
foo:
cmp r0, #1
movls r0, #1
movhi r0, #2
bx lr
See How to remove "noise" from GCC/clang assembly output? for more about making functions that will have useful asm output. In this case, r0 is the first arg-passing register in the calling convention, and also the return-value register.
I also included another C version using if (a <= 1U) to show that it compiles to the same asm. (1U is an unsigned constant, so C integer promotion rules implicitly convert a to unsigned so the types match for the <= operator. You don't need to explicitly do (unsigned)a <= 1U.)
General case: not a single range
For a case like a==0 || a==3 that isn't a single range-check, you can predicate a 2nd cmp. (Godbolt)
foo:
cmp r0, #3 # sets Z if a was 3
cmpne r0, #0 # leaves Z unmodified if it was already set, else sets it according to a == 0
moveq r0, #1
movne r0, #2
bx lr
You can similarly chain && like a==3 && b==4, or for checks like a >= 3 && a <= 7 you can sub / cmp, using the same unsigned-compare trick as the 0 or 1 range check after sub maps a values into the 0..n range. See the Godbolt link for that.

No that does not work.
cmp r0,#1 is it a one
moveq r0,#1 yes, make it a one again?
movne r0,#2 otherwise make it a 2, what if it was a zero to start, now it is a 2
cmp r0,#0 at this point it is either a 1 or a 2 you forced it so it cannot be zero, what it started off is is now lost.
moveq r0,#1
You have the right concept but need to order things better.
following that line of thinking though
maybe use another register
x = 2;
if(a==0) x = 1;
if(a==1) x = 1;
a = x;
Ponder this
if(a==0) a = 1;
if(a!=1) a = 2;
Or as everyone else is going to say ask the compiler.
because of the or, test OR test, generically they need to be done separately the false condition of the first test does not mean the else condition you have to then do the other test before declaring false. But if true you need to hop over everything and not fall into the second test because that might (in this case will) be false...
As Peter points out you can use unsigned less than or equal and greater than conditions (even though in C it is a signed int, bits is bits).
LS Unsigned lower or same
HI Unsigned higher

Depending the ARM instruction sets is can be:
cmp r0, #1
movls r0, #1
movhi r0, #2
bx lr
or
cmp r0, #1
ite ls
movls r0, #1
movhi r0, #2
bx lr
Am I smarter than you? NO I simply use the compiler to compile the C code.
https://godbolt.org/z/dqxv64Eb9

Related

How can I use arm adcs in loop?

In arm assembly language, the instruction ADCS will add with condition flags C and set condition flags.
And the CMP instruction do the same things, so the condition flags will be recovered.
How can I solve it ?
This is my code, it is doing BCD adder with r0 and r1 :
ldr r8, =#0
ldr r9, =#15
adds r7, r8, #0
ADDLOOP:
and r4, r0, r9
and r5, r1, r9
adcs r6, r4, r5
orr r7, r6, r7
add r8, r8, #1
mov r9, r9, lsl #4
cmp r8, #3
bgt ADDEND
bl ADDLOOP
ADDEND:
mov r0, r7
I tried to save the state of condition flags, but I don't know how to do.
To save/restore the Carry flag, you could create a 0/1 integer in a register (perhaps with adc reg, zeroed_reg, #0?), then next iteration cmp reg, #1 or rsbs reg, reg, #1 to set the carry flag from it.
ARM can't materialize C as an integer 0/1 with a single instruction without any setup; compilers normally use movcs r0, #1 / movcc r0, #0 when not in a loop (Godbolt), but in a loop you'd probably want to zero a register once outside the loop instead of using two instructions predicated on carry-set / carry-clear.
Loop without modifying C
Use teq r8, #4 / bne ADDLOOP as the loop branch, like the bottom of a do{}while(r8 != 4).
Or count down from 4 with tst r8,r8 / bne ADDLOOP, using sub r8, #1 instead of add.
TEQ updates N and Z but not C or V flags. (Unless you use a shifted source operand, then it can update C). docs - unlike cmp, it sets flags like eors. The eq / ne conditions work the same: subtraction and XOR both produce zero when the inputs are equal, and non-zero in every other case. But teq doesn't even set C or V flags, and greater / less wouldn't be meaningful anyway.
This is what optimized BigInt code like GMP does, for example in its mpn_add_n function (source) which adds two bigint inputs (arrays of 32-bit chunks).
IDK why you were jumping forwards over a bl (branch-and-link) which sets lr as a return address. Don't do that, structure your asm loops like a do{}while() because it's more efficient, especially when the trip-count is known to be non-zero so you don't have to worry about running the loop zero times in some cases.
There are cbz/cbnz instructions (docs) that jump on a register being zero or non-zero without affecting flags, but they can only jump forwards (out of the loop, past an unconditional branch). They're also only available in Thumb mode, unlike teq which was probably specifically designed to give ARM an efficient way to write BigInt loops.
BCD adding
Your algorithm has bugs; you need base-10 carry, like 0x05 + 0x06 = 0x11 not 0x0b in packed BCD.
And even the binary Carry flag isn't set by something like 0x0005000 + 0x0007000; there's no carry-out from the high bit, only into the next nibble. Also, adc adds the carry-in at the bottom of the register, not at nibble your mask isolated.
So maybe you need to do something like subtract 0x000a000 from the sum (for that example shift position), because that will carry-out. (ARM sets C as a !borrow on subtraction, so maybe rsb reverse-subtract or swap the operands.)
NEON should make it possible to unpack to 8-bit elements (mask odd/even and interleave) and do all nibbles in parallel, but carry propagation is a problem; ARM doesn't have an efficient way to branch on SIMD vector conditions (unlike x86 pmovmskb). Just byte-shifting the vector and adding could generate further carries, as with 999999 + 1.
IDK if this can be cut down effectively with the same techniques hardware uses, like carry-select or carry-lookahead, but for 4-bit BCD digits with SIMD elements instead of single bits with hardware full-adders.
It's not worth doing for binary bigint because you can work in 32 or 64-bit chunks with the carry flag to help, but maybe there's something to gain when primitive hardware operations only do 4 bits at a time.

Finding the smallest element in array in arm assembly

I have this program that i have to write in arm assembly to find the smallest element in an array. Normally this is a pretty easy thing to do in every programming language, but i just can't get my head around what i'm doing wrong in arm assembly. I'm a beginner in arm but i know my way around c. So I wrote the algorithm on how to find the smallest number in an array in c like this.
int minarray = arr[0];
for (int i =0; i < len; i++){
if (arr[i] < minarray){
minarray = arr[i];
}
It's easy and nothing special really.
Now i tried taking over the algorithm in arm almost the same. There are two things that have already been programmed from the beginning. The address of the first element is stored in register r0. The length of the array is stored in register r1. In the end, the smallest element must be stored back in register r0. Here is what i did:
This is almost the same algorithm as the one in c. First i load the first element into a new register r4. Now the first element is the smallest. Then once again, i load the first element in r8. I compare those two, if r8 <= r4, then copy the content of r8 to r4. After that (because i'm working with numbers of 32 bits) i add 4bytes to r0 to get on to the next element of the array. After that i subtract 1 from the array length to loop through the array until its below 0 to stop the program.
The feedback i'm getting from my testing function that was given to us to check if our program works says that it works partly. It says that it works for short arrays and arrays of length 0 but not for long arrays. I'm honestly lost. I think i'm making a really dumb mistake but i just cannot find it and i've been stuck at this easy problem for 3 days now but everything i have tried did not work or as i said, only worked "partly". I would really appreciate if someone could help me out.
This is the feedback that i get:
✗ min works with other numbers
✗ min works with a long array
✓ min works with a short array
✓ min tolerates size = 0
(x is for "it does not work", ✓ is for "it works")
So you see what i'm saying? i just do not understand how to implement the fact that its supposed to work with a longer array.
I'm not very good at ARM assembly by to my understanding R4 is expected to keep the value of minimum. R8 is used to keep the most recently fetched value from the input array.
The minimum is updated with this instruction:
MOVLE r8, r4
But it actually updated R8, not R4.
Try:
MOVLE r4, r8
EDIT
Other issue is using incorrect branch instruction:
SUBS r1, r1, #1
BPL loop1
works like:
r1 = r1 - 1
if (r1 >= 0) goto loop1;
For R1 equal to 1 the loop is exectured twice.
r1 = 1
... do stuff
r1 = r1 - 1 // r1 is 0 now
if (r1 >= 0) goto loop1; // 0>=0 TRUE!
... do stuff, overflow the input by indexing at `[r0 + 4]`
r1 = r1 - 1 // r1 is -1
if (r1 >= 0) goto loop1; // -1 >= 0 FALSE
// exit function
To fix it use branching only when input is non-zero.
BNE loop1
Coding in C use the correct types
You do not have to iterate from the index 0 only 1
int foo(const int *arr, size_t len)
{
int minarray = arr[0];
for (size_t i = 1; i < len; i++)
{
if (arr[i] < minarray)
{
minarray = arr[i];
}
}
return minarray;
}
And it generates this code:
foo:
mov r3, r0
subs r1, r1, #1
ldr r0, [r3], #4
beq .L1
.L3:
ldr r2, [r3], #4
cmp r0, r2
it ge
movge r0, r2
subs r1, r1, #1
bne .L3
.L1:
bx lr

When to use CMP & TEQ instructions in ARM Assembly?

why two separate instructions instead of one instruction? Practically in what kind of situations we need to use CMP and TEQ instructions.
I know how both the instruction works.
short: Both serve different purposes each, cmp is subs without a destination while teq is eors without a destination.
cmp is very straightforward: you compare two numbers A and B
signed:
gt: A > B
ge: A >= B
eq: A == B
le: A <= B
lt: A < B
unsigned:
hi: A > B
hs: A >= B
eq: A == B
ls: A <= B
lo: A < B
Let's assume the problem below though:
int32_t foo(int32_t A)
{
if (((A < 0) && ((A & 1) == 1)) || ((A >= 0) && ((A & 1) == 0)))
{
A += 1;
}
else
{
A -= 1;
}
return A;
}
In human language, the if statement is true if A is either an (odd negative number) or an (even positive number), and Linaro GCC 7.4.1 # O3 will generate that mess below:
foo
0x00000000: CMP r0,#0
0x00000004: AND r3,r0,#1
0x00000008: BLT {pc}+0x14 ; 0x1c
0x0000000C: CMP r3,#0
0x00000010: BEQ {pc}+0x14 ; 0x24
0x00000014: SUB r0,r0,#1
0x00000018: BX lr
0x0000001C: CMP r3,#0
0x00000020: BEQ {pc}-0xc ; 0x14
0x00000024: ADD r0,r0,#1
0x00000028: BX lr
People knowledgeable in the field of bit hacking would alter the if statement like below:
int32_t bar(int32_t A)
{
if ((A ^ (A<<31)) >= 0)
{
A += 1;
}
else
{
A -= 1;
}
return A;
}
And the results are:
bar
0x0000002C: EORS r3,r0,r0,LSL #31
0x00000030: ADDPL r0,r0,#1
0x00000034: SUBMI r0,r0,#1
0x00000038: BX lr
And finally, assembly programmers will replace EORS with teq r0, r0, lsl #31.
It won't make the code any faster, but it doesn't need R3 as the scratch register.
Note that the code above is just a show case, being a separate function where you have excess of available registers.
In real life however, registers are by far the most scarce resource, especially inside a loop, and even compilers will make use of the teq instruction in similar situations.
Summing it up, there are fields such as error correction, decryption/encryption, etc where tons of xor operations are done, and people dealing with those problems just know to appreciate instructions such as teq and when to us them.
And always remember: never trust compilers

"Anomaly" in signed integer in C

I'm currently writing a lecture on ARM optimization, specifically on vector machines such as NEON as the final target.
And since vector machines don't fare well with if-else slaloms, I'm trying to demonstrate how to get rid of them by bit-hacking.
I picked the "saturating absolute" function as an example for this. It's practically an ABS routine with the added functionality of capping the result at 0x7fffffff.
The biggest possible negative 32bit number is 0x80000000, and it's a very dangerous thing because val = -val; returns the same 0x80000000 as the initial value, caused by the asymmetry in the two's complement system especially for DSP operations, and thus, it has to be filtered out, mostly by "saturating".
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
if (val < 0) val = 0x7fffffff;
return val;
}
Below is what I would write in assembly:
cmp r0, #0
rsblts r0, r0, #0
mvnlt r0, #0x80000000
bx lr
And below is what I actually get for the C code above:
satAbs1
0x00000000: CMP r0,#0
0x00000004: RSBLT r0,r0,#0
0x00000008: BX lr
WTH? The compiler simply discarded the saturating part altogether!
The compiler seems to be ruling out val being negative after the first if statement which isn't true if it was 0x80000000
Or maybe the function should return an unsigned value?
uint32_t satAbs2(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result == 0x80000000) result = 0x7fffffff;
return result;
}
satAbs2
0x0000000C: CMP r0,#0
0x00000010: RSBLT r0,r0,#0
0x00000014: BX lr
Unfortunately, it generates the exact same machine codes as the signed version: no saturation.
Again, the compiler seems to rule out the case of val being 0x80000000
Ok, let's widen the range of the second if statement:
uint32_t satAbs3(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result >= 0x80000000) result = 0x7fffffff;
return result;
}
satAbs3
0x00000018: CMP r0,#0
0x0000001C: RSBLT r0,r0,#0
0x00000020: CMP r0,#0
0x00000024: MVNLT r0,#0x80000000
0x00000028: BX lr
Finally, the compiler seems to be doing it's job, albeit sup-optimal (an unnecessary CMP compared to the assembly version)
I can live with the compilers being sub-optimal, but what bothers me is that they are ruling out something that they shouldn't: 0x80000000
I'd even file a bug report to GCC devs on this, but I found out that Clang also rules out the case of the integer being 0x80000000, and thus I suppose I'm missing something regarding to the C standard.
Can anyone tell me where I'm mistaken?
Btw, below is what the if-less bit-hacking version looks like:
int32_t satAbs_bh(int32_t val)
{
int32_t temp = val ^ (val>>31);
val = temp + (val>>31);
val ^= val>>31;
return val;
}
satAbs_bh
0x0000002C: EOR r3,r0,r0,ASR #31
0x00000030: ADD r0,r3,r0,ASR #31
0x00000034: EOR r0,r0,r0,ASR #31
0x00000038: BX lr
Edit: I agree on this question of mine being a duplicate to some degree.
However, it is way more comprehensive including some assembly level stuff and bitmask technics that might be helpful compared to the referred one.
And below comes a workaround on this problem without mangling the compiler option; rule out the possibility of integer overflow preemptively:
int32_t satAbs4(int32_t val)
{
if (val == 0x80000000) return 0x7fffffff;
if (val < 0) val = -val;
return val;
}
satAbs4
0x0000002C: CMP r0,#0x80000000
0x00000030: BEQ {pc}+0x10 ; 0x40
0x00000034: CMP r0,#0
0x00000038: RSBLT r0,r0,#0
0x0000003C: BX lr
0x00000040: MVN r0,#0x80000000
0x00000044: BX lr
Again, the linaro GCC 7.4.1 I'm using demonstrates its shortcomings: I don't understand the BEQ in line 2. moveq r0, #0x80000001 as suggested in the source code could have saved two instructions at the end.
Signed integer overflow or underflow is undefined behavior in C, meaning that you are expected to handle these edge cases yourself. In other words, as soon as the compiler has established that a certain signed integer value is positive, it doesn't care if there is a possibility that it could turn negative through UB.
For example, this code:
int test(int input)
{
if (input > 0)
input += 100;
if (input > 0)
input += 100;
if (input > 0)
input += 100;
return input;
}
can legally be optimized to this:
int test(int input)
{
if (input > 0)
input += 300;
return input;
}
even though the author of the initial code might have expected that input could overflow between each successive statements.
That's why an optimizing compiler sees your code as something like this:
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
// val must be positive here,
// unless you are relying on UB
// the following condition is
// therefore always false:
// if (val < 0) val = 0x7fffffff;
return val;
}
So, the only way to avoid UB is to avoid negating the signed integer if there is a chance that it might invoke UB, i.e.:
int32_t satAbs3_simple(int32_t val)
{
if (val >= 0)
return val;
// we know that val is negative here,
// but unfortunately gcc knows it as well,
// so we'll handle the edge case explicitly
if (val == INT32_MIN)
return INT32_MAX;
return -val;
}
gcc with -O2 produces code with a branch (early conditional return at bxge):
satAbs3_basic:
cmp r0, #0
bxge lr // return r0 if ge #0
cmp r0, #0x80000000
rsbne r0, r0, #0
moveq r0, #0x7FFFFFFF
bx lr
As #rici mentioned in the comments, if exact-width signed int types from stdint.h (intN_t) are available on your compiler, this means they have to be represented with N bits, no padding, using 2's complement.
This means that you can rewrite the code slightly to use bit masks, which might provide a slightly shorter assembly output (at least with gcc 5 or newer), still without branching:
int32_t satAbs3_c(int32_t val)
{
uint32_t result = (uint32_t)val;
if (result & 0x80000000) result = -result; // <-- avoid UB here by negating uint32_t
if (result == 0x80000000) result = 0x7FFFFFFF;
return (int32_t)result;
}
Note that an optimizing compiler should theoretically be able to produce this same output for both cases, but anyway, recent gcc versions (with -O1) for the last snippet give:
satAbs3_c:
cmp r0, #0
rsblt r0, r0, #0
cmp r0, #0x80000000
moveq r0, #0x7FFFFFFF
bx lr
I actually believe it cannot get shorter than this (apart from the xor bit-hacking), because your initial assembly seems to lack a cmp r0, #0 instruction after rsblts (because rsblts changes r0, and cmp is the part where actual comparison takes place).

Creating Nested If-Statements in ARM Assembly

I am interested in converting a Fibonacci sequence code in C++ into ARM assembly language. The code in C++ is as follows:
#include <iostream>
using namespace std;
int main()
{
int range, first = 0 , second = 1, fibonacci;
cout << "Enter range for the Fibonacci Sequence" << endl;
cin >> range;
for (int i = 0; i < range; i++)
{
if (i <=1)
{
fibonacci = i;
}
else
{
fibonacci = first and second;
first = second;
second = fibonacci;
}
}
cout << fibonacci << endl;
return 0;
}
My attempt at converting this to assembly is as follows:
ldr r0, =0x00000000 ;loads 0 in r0
ldr r1, =0x00000001 ;loads 1 into r1
ldr r2, =0x00000002 ;loads 2 into r2, this will be the equivalent of 'n' in C++ code,
but I will force the value of 'n' when writing this code
ldr r3, =0x00000000 ;r3 will be used as a counter in the loop
;r4 will be used as 'fibonacci'
loop:
cmp r3, #2 ;Compares r3 with a value of 0
it lt
movlt r4, r3 ;If r3 is less than #0, r4 will equal r3. This means r4 will only ever be
0 or 1.
it eq ;If r3 is equal to 2, run through these instructions
addeq r4, r0, r1
moveq r0,r1
mov r1, r4
adds r3, r3, #1 ;Increases the counter by one
it gt ;Similarly, if r3 is greater than 2, run though these instructions
addgt r4, r0, r1
movgt r0, r1
mov r1, r4
adds r3, r3, #1
I'm not entirely sure if that is how you do if statements in Assembly, but that will be a secondary concern for me at this point. What I am more interested in, is how I can incorporate an if statement in order to test for the initial condition where the 'counter' is compared to the 'range'. If counter < range, then it should go into the main body of the code where the fibonacci statement will be iterated. It will then continue to loop until counter = range.
I am not sure how to do the following:
cmp r3, r2
;If r3 < r2
{
<code>
}
;else, stop
Also, in order for this to loop correctly, am I able to add:
cmp r3, r2
bne loop
So that the loop iterates until r3 = r2?
Thanks in advance :)
It's not wise to put if-statements inside a loop. Get rid of it.
An optimized(kinda) standalone Fibonacci function should be like this:
unsigned int fib(unsigned int n)
{
unsigned int first = 0;
unsigned int second = 1;
unsigned int temp;
if (n > 47) return 0xffffffff; // overflow check
if (n < 2) return n;
n -= 1;
while (1)
{
n -= 1;
if (n == 0) return second;
temp = first + second;
first = second;
second = temp
}
}
Much like factorial, optimizing Fibonacci sequence is somewhat nonsense in real world computing, because they exceed the 32-bit barrier really soon: It's 12 with factorial and 47 with Fibonacci.
If you really need them, you are served the best with very short lookup tables.
If you need this function fully implemented for larger values:
https://www.nayuki.io/page/fast-fibonacci-algorithms
Last but not least, here is the function above in assembly:
cmp r0, #47 // r0 is n
movhi r0, #-1 // overflow check
bxhi lr
cmp r0, #2
bxlo lr
sub r2, r0, #1 // r2 is the counter now
mov r1, #0 // r1 is first
mov r0, #1 // r0 is second
loop:
subs r2, r2, #1 // n -= 1
add r12, r0, r1 // temp = first + second
mov r1, r0 // first = second
bxeq lr // return second when condition is met
mov r0, r12 // second = temp
b loop
Please note that the last bxeq lr can be placed immediately after subs which might seem more logical, but with the multiple issuing capability of the Cortex series in mind, it's better in this order.
It might be not exactly the answer you were looking for, but keep this in mind: A single if statement inside a loop can seriously cripple the performance - a nested one even more.
And there are almost always ways avoiding these. You just have to look for them.
Conditionals compile to conditional jumps in almost all assembly language:
if (condition)
..iftrue..
else
..iffalse..
becomes
eval condition
conditional_jump_if_true truelabel
..iffalse..
unconditional_jump endlabel
truelabel:
..iftrue..
endlabel:
or the other way around (exchange false and true).
ARM supports conditional execution to eliminate these jumps when compiling the innermost conditionals: http://www.davespace.co.uk/arm/introduction-to-arm/conditional.html
IT... is a Thumb-2 instruction: http://en.wikipedia.org/wiki/ARM_architecture#Thumb-2 to support unified assemblies. See http://www.keil.com/support/man/docs/armasm/armasm_BABJGFDD.htm for more details.
Your code for looping (cmp and bne) is fine.
In general, try to rewrite your code using gotos instead of cycles, and else parts.
else can remain only at the deepest nesting level.
Then you can convert this semi-assembly code to assembly much more easily.
HTH

Resources