I'm currently writing a lecture on ARM optimization, specifically on vector machines such as NEON as the final target.
And since vector machines don't fare well with if-else slaloms, I'm trying to demonstrate how to get rid of them by bit-hacking.
I picked the "saturating absolute" function as an example for this. It's practically an ABS routine with the added functionality of capping the result at 0x7fffffff.
The biggest possible negative 32bit number is 0x80000000, and it's a very dangerous thing because val = -val; returns the same 0x80000000 as the initial value, caused by the asymmetry in the two's complement system especially for DSP operations, and thus, it has to be filtered out, mostly by "saturating".
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
if (val < 0) val = 0x7fffffff;
return val;
}
Below is what I would write in assembly:
cmp r0, #0
rsblts r0, r0, #0
mvnlt r0, #0x80000000
bx lr
And below is what I actually get for the C code above:
satAbs1
0x00000000: CMP r0,#0
0x00000004: RSBLT r0,r0,#0
0x00000008: BX lr
WTH? The compiler simply discarded the saturating part altogether!
The compiler seems to be ruling out val being negative after the first if statement which isn't true if it was 0x80000000
Or maybe the function should return an unsigned value?
uint32_t satAbs2(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result == 0x80000000) result = 0x7fffffff;
return result;
}
satAbs2
0x0000000C: CMP r0,#0
0x00000010: RSBLT r0,r0,#0
0x00000014: BX lr
Unfortunately, it generates the exact same machine codes as the signed version: no saturation.
Again, the compiler seems to rule out the case of val being 0x80000000
Ok, let's widen the range of the second if statement:
uint32_t satAbs3(int32_t val)
{
uint32_t result;
if (val < 0) result = (uint32_t) -val; else result = (uint32_t) val;
if (result >= 0x80000000) result = 0x7fffffff;
return result;
}
satAbs3
0x00000018: CMP r0,#0
0x0000001C: RSBLT r0,r0,#0
0x00000020: CMP r0,#0
0x00000024: MVNLT r0,#0x80000000
0x00000028: BX lr
Finally, the compiler seems to be doing it's job, albeit sup-optimal (an unnecessary CMP compared to the assembly version)
I can live with the compilers being sub-optimal, but what bothers me is that they are ruling out something that they shouldn't: 0x80000000
I'd even file a bug report to GCC devs on this, but I found out that Clang also rules out the case of the integer being 0x80000000, and thus I suppose I'm missing something regarding to the C standard.
Can anyone tell me where I'm mistaken?
Btw, below is what the if-less bit-hacking version looks like:
int32_t satAbs_bh(int32_t val)
{
int32_t temp = val ^ (val>>31);
val = temp + (val>>31);
val ^= val>>31;
return val;
}
satAbs_bh
0x0000002C: EOR r3,r0,r0,ASR #31
0x00000030: ADD r0,r3,r0,ASR #31
0x00000034: EOR r0,r0,r0,ASR #31
0x00000038: BX lr
Edit: I agree on this question of mine being a duplicate to some degree.
However, it is way more comprehensive including some assembly level stuff and bitmask technics that might be helpful compared to the referred one.
And below comes a workaround on this problem without mangling the compiler option; rule out the possibility of integer overflow preemptively:
int32_t satAbs4(int32_t val)
{
if (val == 0x80000000) return 0x7fffffff;
if (val < 0) val = -val;
return val;
}
satAbs4
0x0000002C: CMP r0,#0x80000000
0x00000030: BEQ {pc}+0x10 ; 0x40
0x00000034: CMP r0,#0
0x00000038: RSBLT r0,r0,#0
0x0000003C: BX lr
0x00000040: MVN r0,#0x80000000
0x00000044: BX lr
Again, the linaro GCC 7.4.1 I'm using demonstrates its shortcomings: I don't understand the BEQ in line 2. moveq r0, #0x80000001 as suggested in the source code could have saved two instructions at the end.
Signed integer overflow or underflow is undefined behavior in C, meaning that you are expected to handle these edge cases yourself. In other words, as soon as the compiler has established that a certain signed integer value is positive, it doesn't care if there is a possibility that it could turn negative through UB.
For example, this code:
int test(int input)
{
if (input > 0)
input += 100;
if (input > 0)
input += 100;
if (input > 0)
input += 100;
return input;
}
can legally be optimized to this:
int test(int input)
{
if (input > 0)
input += 300;
return input;
}
even though the author of the initial code might have expected that input could overflow between each successive statements.
That's why an optimizing compiler sees your code as something like this:
int32_t satAbs1(int32_t val)
{
if (val < 0) val = -val;
// val must be positive here,
// unless you are relying on UB
// the following condition is
// therefore always false:
// if (val < 0) val = 0x7fffffff;
return val;
}
So, the only way to avoid UB is to avoid negating the signed integer if there is a chance that it might invoke UB, i.e.:
int32_t satAbs3_simple(int32_t val)
{
if (val >= 0)
return val;
// we know that val is negative here,
// but unfortunately gcc knows it as well,
// so we'll handle the edge case explicitly
if (val == INT32_MIN)
return INT32_MAX;
return -val;
}
gcc with -O2 produces code with a branch (early conditional return at bxge):
satAbs3_basic:
cmp r0, #0
bxge lr // return r0 if ge #0
cmp r0, #0x80000000
rsbne r0, r0, #0
moveq r0, #0x7FFFFFFF
bx lr
As #rici mentioned in the comments, if exact-width signed int types from stdint.h (intN_t) are available on your compiler, this means they have to be represented with N bits, no padding, using 2's complement.
This means that you can rewrite the code slightly to use bit masks, which might provide a slightly shorter assembly output (at least with gcc 5 or newer), still without branching:
int32_t satAbs3_c(int32_t val)
{
uint32_t result = (uint32_t)val;
if (result & 0x80000000) result = -result; // <-- avoid UB here by negating uint32_t
if (result == 0x80000000) result = 0x7FFFFFFF;
return (int32_t)result;
}
Note that an optimizing compiler should theoretically be able to produce this same output for both cases, but anyway, recent gcc versions (with -O1) for the last snippet give:
satAbs3_c:
cmp r0, #0
rsblt r0, r0, #0
cmp r0, #0x80000000
moveq r0, #0x7FFFFFFF
bx lr
I actually believe it cannot get shorter than this (apart from the xor bit-hacking), because your initial assembly seems to lack a cmp r0, #0 instruction after rsblts (because rsblts changes r0, and cmp is the part where actual comparison takes place).
Related
I have this program that i have to write in arm assembly to find the smallest element in an array. Normally this is a pretty easy thing to do in every programming language, but i just can't get my head around what i'm doing wrong in arm assembly. I'm a beginner in arm but i know my way around c. So I wrote the algorithm on how to find the smallest number in an array in c like this.
int minarray = arr[0];
for (int i =0; i < len; i++){
if (arr[i] < minarray){
minarray = arr[i];
}
It's easy and nothing special really.
Now i tried taking over the algorithm in arm almost the same. There are two things that have already been programmed from the beginning. The address of the first element is stored in register r0. The length of the array is stored in register r1. In the end, the smallest element must be stored back in register r0. Here is what i did:
This is almost the same algorithm as the one in c. First i load the first element into a new register r4. Now the first element is the smallest. Then once again, i load the first element in r8. I compare those two, if r8 <= r4, then copy the content of r8 to r4. After that (because i'm working with numbers of 32 bits) i add 4bytes to r0 to get on to the next element of the array. After that i subtract 1 from the array length to loop through the array until its below 0 to stop the program.
The feedback i'm getting from my testing function that was given to us to check if our program works says that it works partly. It says that it works for short arrays and arrays of length 0 but not for long arrays. I'm honestly lost. I think i'm making a really dumb mistake but i just cannot find it and i've been stuck at this easy problem for 3 days now but everything i have tried did not work or as i said, only worked "partly". I would really appreciate if someone could help me out.
This is the feedback that i get:
✗ min works with other numbers
✗ min works with a long array
✓ min works with a short array
✓ min tolerates size = 0
(x is for "it does not work", ✓ is for "it works")
So you see what i'm saying? i just do not understand how to implement the fact that its supposed to work with a longer array.
I'm not very good at ARM assembly by to my understanding R4 is expected to keep the value of minimum. R8 is used to keep the most recently fetched value from the input array.
The minimum is updated with this instruction:
MOVLE r8, r4
But it actually updated R8, not R4.
Try:
MOVLE r4, r8
EDIT
Other issue is using incorrect branch instruction:
SUBS r1, r1, #1
BPL loop1
works like:
r1 = r1 - 1
if (r1 >= 0) goto loop1;
For R1 equal to 1 the loop is exectured twice.
r1 = 1
... do stuff
r1 = r1 - 1 // r1 is 0 now
if (r1 >= 0) goto loop1; // 0>=0 TRUE!
... do stuff, overflow the input by indexing at `[r0 + 4]`
r1 = r1 - 1 // r1 is -1
if (r1 >= 0) goto loop1; // -1 >= 0 FALSE
// exit function
To fix it use branching only when input is non-zero.
BNE loop1
Coding in C use the correct types
You do not have to iterate from the index 0 only 1
int foo(const int *arr, size_t len)
{
int minarray = arr[0];
for (size_t i = 1; i < len; i++)
{
if (arr[i] < minarray)
{
minarray = arr[i];
}
}
return minarray;
}
And it generates this code:
foo:
mov r3, r0
subs r1, r1, #1
ldr r0, [r3], #4
beq .L1
.L3:
ldr r2, [r3], #4
cmp r0, r2
it ge
movge r0, r2
subs r1, r1, #1
bne .L3
.L1:
bx lr
I have a C code in my mind which I want to implement in ARM Programming Language.
The C code I have in my mind is something of this sort:
int a;
scanf("%d",&a);
if(a == 0 || a == 1){
a = 1;
}
else{
a = 2;
}
What I have tried:
//arm equivalent of taking input to reg r0
//check for first condition
cmp r0,#1
moveq r0,#1
//if false
movne r0,#2
//check for second condition
cmp r0,#0
moveq r0,#1
Is this the correct way of implementing it?
Your code is broken for a=0 - single step through it in your head, or in a debugger, to see what happens.
Given this specific condition, it's equivalent to (unsigned)a <= 1U (because negative integer convert to huge unsigned values). You can do a single cmp and movls / movhi. Compilers already spot this optimization; here's how to ask a compiler to make asm for you so you can learn the tricks clever humans programmed into them:
int foo(int a) {
if(a == 0 || a == 1){
a = 1;
}
else{
a = 2;
}
return a;
}
With ARM GCC10 -O3 -marm on the Godbolt compiler explorer:
foo:
cmp r0, #1
movls r0, #1
movhi r0, #2
bx lr
See How to remove "noise" from GCC/clang assembly output? for more about making functions that will have useful asm output. In this case, r0 is the first arg-passing register in the calling convention, and also the return-value register.
I also included another C version using if (a <= 1U) to show that it compiles to the same asm. (1U is an unsigned constant, so C integer promotion rules implicitly convert a to unsigned so the types match for the <= operator. You don't need to explicitly do (unsigned)a <= 1U.)
General case: not a single range
For a case like a==0 || a==3 that isn't a single range-check, you can predicate a 2nd cmp. (Godbolt)
foo:
cmp r0, #3 # sets Z if a was 3
cmpne r0, #0 # leaves Z unmodified if it was already set, else sets it according to a == 0
moveq r0, #1
movne r0, #2
bx lr
You can similarly chain && like a==3 && b==4, or for checks like a >= 3 && a <= 7 you can sub / cmp, using the same unsigned-compare trick as the 0 or 1 range check after sub maps a values into the 0..n range. See the Godbolt link for that.
No that does not work.
cmp r0,#1 is it a one
moveq r0,#1 yes, make it a one again?
movne r0,#2 otherwise make it a 2, what if it was a zero to start, now it is a 2
cmp r0,#0 at this point it is either a 1 or a 2 you forced it so it cannot be zero, what it started off is is now lost.
moveq r0,#1
You have the right concept but need to order things better.
following that line of thinking though
maybe use another register
x = 2;
if(a==0) x = 1;
if(a==1) x = 1;
a = x;
Ponder this
if(a==0) a = 1;
if(a!=1) a = 2;
Or as everyone else is going to say ask the compiler.
because of the or, test OR test, generically they need to be done separately the false condition of the first test does not mean the else condition you have to then do the other test before declaring false. But if true you need to hop over everything and not fall into the second test because that might (in this case will) be false...
As Peter points out you can use unsigned less than or equal and greater than conditions (even though in C it is a signed int, bits is bits).
LS Unsigned lower or same
HI Unsigned higher
Depending the ARM instruction sets is can be:
cmp r0, #1
movls r0, #1
movhi r0, #2
bx lr
or
cmp r0, #1
ite ls
movls r0, #1
movhi r0, #2
bx lr
Am I smarter than you? NO I simply use the compiler to compile the C code.
https://godbolt.org/z/dqxv64Eb9
why two separate instructions instead of one instruction? Practically in what kind of situations we need to use CMP and TEQ instructions.
I know how both the instruction works.
short: Both serve different purposes each, cmp is subs without a destination while teq is eors without a destination.
cmp is very straightforward: you compare two numbers A and B
signed:
gt: A > B
ge: A >= B
eq: A == B
le: A <= B
lt: A < B
unsigned:
hi: A > B
hs: A >= B
eq: A == B
ls: A <= B
lo: A < B
Let's assume the problem below though:
int32_t foo(int32_t A)
{
if (((A < 0) && ((A & 1) == 1)) || ((A >= 0) && ((A & 1) == 0)))
{
A += 1;
}
else
{
A -= 1;
}
return A;
}
In human language, the if statement is true if A is either an (odd negative number) or an (even positive number), and Linaro GCC 7.4.1 # O3 will generate that mess below:
foo
0x00000000: CMP r0,#0
0x00000004: AND r3,r0,#1
0x00000008: BLT {pc}+0x14 ; 0x1c
0x0000000C: CMP r3,#0
0x00000010: BEQ {pc}+0x14 ; 0x24
0x00000014: SUB r0,r0,#1
0x00000018: BX lr
0x0000001C: CMP r3,#0
0x00000020: BEQ {pc}-0xc ; 0x14
0x00000024: ADD r0,r0,#1
0x00000028: BX lr
People knowledgeable in the field of bit hacking would alter the if statement like below:
int32_t bar(int32_t A)
{
if ((A ^ (A<<31)) >= 0)
{
A += 1;
}
else
{
A -= 1;
}
return A;
}
And the results are:
bar
0x0000002C: EORS r3,r0,r0,LSL #31
0x00000030: ADDPL r0,r0,#1
0x00000034: SUBMI r0,r0,#1
0x00000038: BX lr
And finally, assembly programmers will replace EORS with teq r0, r0, lsl #31.
It won't make the code any faster, but it doesn't need R3 as the scratch register.
Note that the code above is just a show case, being a separate function where you have excess of available registers.
In real life however, registers are by far the most scarce resource, especially inside a loop, and even compilers will make use of the teq instruction in similar situations.
Summing it up, there are fields such as error correction, decryption/encryption, etc where tons of xor operations are done, and people dealing with those problems just know to appreciate instructions such as teq and when to us them.
And always remember: never trust compilers
I have the following issue: I have two 64 bit variables and they have to be compared as quick as possible, my Microcontroller is only 32bit.
My thoughts are that it is necessary to divide 64 bit variable into two 32 bit variables, like this
uint64_t var = 0xAAFFFFFFABCDELL;
hiPart = (uint32_t)((var & 0xFFFFFFFF00000000LL) >> 32);
loPart = (uint32_t)(var & 0xFFFFFFFFLL);
and then to compare hiParts and loParts, but I am sure that this approach is slow and there is much better solution
The first rule should be: Write your program, so that is readable to a human.
When in doubt, don't assume anything, but measure it. Let's see, what godbolt gives us.
#include <stdint.h>
#include <stdbool.h>
bool foo(uint64_t a, uint64_t b) {
return a == b;
}
bool foo2(uint64_t a, uint64_t b) {
uint32_t ahiPart = (uint32_t)((a & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t aloPart = (uint32_t)(a & 0xFFFFFFFFULL);
uint32_t bhiPart = (uint32_t)((b & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t bloPart = (uint32_t)(b & 0xFFFFFFFFULL);
return ahiPart == bhiPart && aloPart == bloPart;
}
foo:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
foo2:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
As you can see, they result in the exact same assembly code, but you decide, which one is less error prone and easiert to read?
There was a time some years ago where you need to do tricks to be more smart than a compiler. But in 99.999% the compiler will be more smart than you.
And your variables are unsigned. So use ULL instead of LL.
The fastest way is to let the compiler do it. Most compilers are much better than humans at micro-optimization.
uint64_t var = …, other_var = …;
if (var == other_var) …
There aren't many ways to go about it. Under the hood, the compiler will arrange to load the upper 32 bits and the lower 32 bits of each variables into registers, and compare the two registers that contain upper 32 bits and the two registers that contain lower 32 bits. The assembly code might look something like this:
load 32 bits from &var into r0
load 32 bits from &other_var into r1
if r0 != r1: goto different
load 32 bits from &var + 4 into r2
load 32 bits from &other_var + 4 into r3
if r2 != r3: goto different
// code for if-equal
different:
// code for if-not-equal
Here are some things the compiler knows better than you:
Which registers to use, based on the needs of the surrounding code.
Whether to reuse the same registers to compare the upper and lower parts, or to use different registers.
Whether to process one part and then the other (as above), or to load one variable then the other. The best order depends on the pressure on registers and on the memory access times and pipelining of the particular processor model.
If you work with a union you could compare Hi and Lo Part without any extra calculations:
typedef union
{
struct
{
uint32_t loPart;
uint32_t hiPart;
};
uint64_t complete;
}uint64T;
uint64T var.complete = 0xAAFFFFFFABCDEULL;
I am interested in converting a Fibonacci sequence code in C++ into ARM assembly language. The code in C++ is as follows:
#include <iostream>
using namespace std;
int main()
{
int range, first = 0 , second = 1, fibonacci;
cout << "Enter range for the Fibonacci Sequence" << endl;
cin >> range;
for (int i = 0; i < range; i++)
{
if (i <=1)
{
fibonacci = i;
}
else
{
fibonacci = first and second;
first = second;
second = fibonacci;
}
}
cout << fibonacci << endl;
return 0;
}
My attempt at converting this to assembly is as follows:
ldr r0, =0x00000000 ;loads 0 in r0
ldr r1, =0x00000001 ;loads 1 into r1
ldr r2, =0x00000002 ;loads 2 into r2, this will be the equivalent of 'n' in C++ code,
but I will force the value of 'n' when writing this code
ldr r3, =0x00000000 ;r3 will be used as a counter in the loop
;r4 will be used as 'fibonacci'
loop:
cmp r3, #2 ;Compares r3 with a value of 0
it lt
movlt r4, r3 ;If r3 is less than #0, r4 will equal r3. This means r4 will only ever be
0 or 1.
it eq ;If r3 is equal to 2, run through these instructions
addeq r4, r0, r1
moveq r0,r1
mov r1, r4
adds r3, r3, #1 ;Increases the counter by one
it gt ;Similarly, if r3 is greater than 2, run though these instructions
addgt r4, r0, r1
movgt r0, r1
mov r1, r4
adds r3, r3, #1
I'm not entirely sure if that is how you do if statements in Assembly, but that will be a secondary concern for me at this point. What I am more interested in, is how I can incorporate an if statement in order to test for the initial condition where the 'counter' is compared to the 'range'. If counter < range, then it should go into the main body of the code where the fibonacci statement will be iterated. It will then continue to loop until counter = range.
I am not sure how to do the following:
cmp r3, r2
;If r3 < r2
{
<code>
}
;else, stop
Also, in order for this to loop correctly, am I able to add:
cmp r3, r2
bne loop
So that the loop iterates until r3 = r2?
Thanks in advance :)
It's not wise to put if-statements inside a loop. Get rid of it.
An optimized(kinda) standalone Fibonacci function should be like this:
unsigned int fib(unsigned int n)
{
unsigned int first = 0;
unsigned int second = 1;
unsigned int temp;
if (n > 47) return 0xffffffff; // overflow check
if (n < 2) return n;
n -= 1;
while (1)
{
n -= 1;
if (n == 0) return second;
temp = first + second;
first = second;
second = temp
}
}
Much like factorial, optimizing Fibonacci sequence is somewhat nonsense in real world computing, because they exceed the 32-bit barrier really soon: It's 12 with factorial and 47 with Fibonacci.
If you really need them, you are served the best with very short lookup tables.
If you need this function fully implemented for larger values:
https://www.nayuki.io/page/fast-fibonacci-algorithms
Last but not least, here is the function above in assembly:
cmp r0, #47 // r0 is n
movhi r0, #-1 // overflow check
bxhi lr
cmp r0, #2
bxlo lr
sub r2, r0, #1 // r2 is the counter now
mov r1, #0 // r1 is first
mov r0, #1 // r0 is second
loop:
subs r2, r2, #1 // n -= 1
add r12, r0, r1 // temp = first + second
mov r1, r0 // first = second
bxeq lr // return second when condition is met
mov r0, r12 // second = temp
b loop
Please note that the last bxeq lr can be placed immediately after subs which might seem more logical, but with the multiple issuing capability of the Cortex series in mind, it's better in this order.
It might be not exactly the answer you were looking for, but keep this in mind: A single if statement inside a loop can seriously cripple the performance - a nested one even more.
And there are almost always ways avoiding these. You just have to look for them.
Conditionals compile to conditional jumps in almost all assembly language:
if (condition)
..iftrue..
else
..iffalse..
becomes
eval condition
conditional_jump_if_true truelabel
..iffalse..
unconditional_jump endlabel
truelabel:
..iftrue..
endlabel:
or the other way around (exchange false and true).
ARM supports conditional execution to eliminate these jumps when compiling the innermost conditionals: http://www.davespace.co.uk/arm/introduction-to-arm/conditional.html
IT... is a Thumb-2 instruction: http://en.wikipedia.org/wiki/ARM_architecture#Thumb-2 to support unified assemblies. See http://www.keil.com/support/man/docs/armasm/armasm_BABJGFDD.htm for more details.
Your code for looping (cmp and bne) is fine.
In general, try to rewrite your code using gotos instead of cycles, and else parts.
else can remain only at the deepest nesting level.
Then you can convert this semi-assembly code to assembly much more easily.
HTH