Random number with modulo ARM - arm

I'm coding a guess the number game in ARM to learn more about it.
But i'm stuck at the random number generation...
I call rand() but I don't know how to do the modulo 100 to generate a number in the range 0-100.
I don't think there is modulo in ARM and "and r0, r0, #100" is not very random...
Here is the start of the pseudo random number generation :
mov r0, #0
bl time
bl srand
bl rand

ANDs only work if the remainder you are interested in comes from the division with a power of 2. One thing you could do would be to use one such value instead of 100.
An alternative would be just to use the remainder theorem:
a % 100 = a - (100 * int(a/100))
This is what gcc does as well (though it avoids actually dividing with 100 by something called reciprocal multiplication; if your ARM supports integer divide instructions, though, you're lucky).

Related

How can I use arm adcs in loop?

In arm assembly language, the instruction ADCS will add with condition flags C and set condition flags.
And the CMP instruction do the same things, so the condition flags will be recovered.
How can I solve it ?
This is my code, it is doing BCD adder with r0 and r1 :
ldr r8, =#0
ldr r9, =#15
adds r7, r8, #0
ADDLOOP:
and r4, r0, r9
and r5, r1, r9
adcs r6, r4, r5
orr r7, r6, r7
add r8, r8, #1
mov r9, r9, lsl #4
cmp r8, #3
bgt ADDEND
bl ADDLOOP
ADDEND:
mov r0, r7
I tried to save the state of condition flags, but I don't know how to do.
To save/restore the Carry flag, you could create a 0/1 integer in a register (perhaps with adc reg, zeroed_reg, #0?), then next iteration cmp reg, #1 or rsbs reg, reg, #1 to set the carry flag from it.
ARM can't materialize C as an integer 0/1 with a single instruction without any setup; compilers normally use movcs r0, #1 / movcc r0, #0 when not in a loop (Godbolt), but in a loop you'd probably want to zero a register once outside the loop instead of using two instructions predicated on carry-set / carry-clear.
Loop without modifying C
Use teq r8, #4 / bne ADDLOOP as the loop branch, like the bottom of a do{}while(r8 != 4).
Or count down from 4 with tst r8,r8 / bne ADDLOOP, using sub r8, #1 instead of add.
TEQ updates N and Z but not C or V flags. (Unless you use a shifted source operand, then it can update C). docs - unlike cmp, it sets flags like eors. The eq / ne conditions work the same: subtraction and XOR both produce zero when the inputs are equal, and non-zero in every other case. But teq doesn't even set C or V flags, and greater / less wouldn't be meaningful anyway.
This is what optimized BigInt code like GMP does, for example in its mpn_add_n function (source) which adds two bigint inputs (arrays of 32-bit chunks).
IDK why you were jumping forwards over a bl (branch-and-link) which sets lr as a return address. Don't do that, structure your asm loops like a do{}while() because it's more efficient, especially when the trip-count is known to be non-zero so you don't have to worry about running the loop zero times in some cases.
There are cbz/cbnz instructions (docs) that jump on a register being zero or non-zero without affecting flags, but they can only jump forwards (out of the loop, past an unconditional branch). They're also only available in Thumb mode, unlike teq which was probably specifically designed to give ARM an efficient way to write BigInt loops.
BCD adding
Your algorithm has bugs; you need base-10 carry, like 0x05 + 0x06 = 0x11 not 0x0b in packed BCD.
And even the binary Carry flag isn't set by something like 0x0005000 + 0x0007000; there's no carry-out from the high bit, only into the next nibble. Also, adc adds the carry-in at the bottom of the register, not at nibble your mask isolated.
So maybe you need to do something like subtract 0x000a000 from the sum (for that example shift position), because that will carry-out. (ARM sets C as a !borrow on subtraction, so maybe rsb reverse-subtract or swap the operands.)
NEON should make it possible to unpack to 8-bit elements (mask odd/even and interleave) and do all nibbles in parallel, but carry propagation is a problem; ARM doesn't have an efficient way to branch on SIMD vector conditions (unlike x86 pmovmskb). Just byte-shifting the vector and adding could generate further carries, as with 999999 + 1.
IDK if this can be cut down effectively with the same techniques hardware uses, like carry-select or carry-lookahead, but for 4-bit BCD digits with SIMD elements instead of single bits with hardware full-adders.
It's not worth doing for binary bigint because you can work in 32 or 64-bit chunks with the carry flag to help, but maybe there's something to gain when primitive hardware operations only do 4 bits at a time.

How to make this LC3 program multiply instead?

Was trying to learn how to multiply in LC3 but having trouble modifying my old program that was just meant for adding sums. How would I go about modifying this program to multiply by the 2 given inputs?
Code:
.ORIG x3000 ; begin at x3000
; input two numbers
IN ;input an integer character (ascii) {TRAP 23}
LD R3, HEXN30 ;subtract x30 to get integer
ADD R0, R0, R3
ADD R1, R0, x0 ;move the first integer to register 1
IN ;input another integer {TRAP 23}
ADD R0, R0, R3 ;convert it to an integer
; add the numbers
ADD R2, R0, R1 ;add the two integers
; print the results
LEA R0, MESG ;load the address of the message string
PUTS ;"PUTS" outputs a string {TRAP 22}
ADD R0, R2, x0 ;move the sum to R0, to be output
LD R3, HEX30 ;add 30 to integer to get integer character
ADD R0, R0, R3
OUT ;display the sum {TRAP 21}
; stop
HALT ;{TRAP 25}
; data
MESG .STRINGZ "The sum of those two numbers is: "
HEXN30 .FILL xFFD0 ; -30 HEX
HEX30 .FILL x0030 ; 30 HEX
.END```
The simplest approach to multiply on LC-3 is repetitive addition. So keep summing the multiplicand and decrement the multiplier; the iteration stops when the multiplier is consumed (i.e. zero).
There are lot's of caveats: if the multiplier is negative, then we would either negate it to use with count down, or count up instead — either way, the final result would be negated.
Since multiplication is commutative, we might consider using the lessor (absolute) value for the multiplier so that fewer iterations are done. But for more optimal multiplication, we would switch to a whole 'nother algorithm, the shift and add. Note that this algorithm is usually presented for hardware implementation, in which saving precious register bits is important, whereas for software this is not a really significant concern.

How do I compute the 16-bit sum of the 8-bit values of an array in assembly?

Feel like I've been asking a lot of these questions lately lol, but assembly is still pretty foreign to me.
Using an Arduino, I have to write a function in Atmel AVR Assembly for my computer science class that calculates the sum of the 8-bit values in an array and returns it as a 16-bit integer. The function is supposed to take in an array of bytes and a byte representing the length of the array as arguments, with those arguments stored in r24 and r22, respectively, when the function is called. I am allowed to use branching instructions and such.
The code is in this format:
.global sumArray
sumArray:
//magic happens
ret
I know how to make loops and increment the counter and things like that, but I am really lost as to how I would do this.
I am unsure as to how I would do this. Does anyone know how to write this function in Atmel AVR Assembly? Any help would be much appreciated!
Why don't you ask the question to your compiler?
#include <stdint.h>
uint16_t sumArray(uint8_t *val, uint8_t count)
{
uint16_t sum = 0;
for (uint8_t i = 0; i < count; i++)
sum += val[i];
return sum;
}
Compiling with avr-gcc -std=c99 -mmcu=avr5 -Os -S sum8-16.c generates
the following assembly:
.global sumArray
sumArray:
mov r19, r24
movw r30, r24
ldi r24, 0
ldi r25, 0
.L2:
mov r18, r30
sub r18, r19
cp r18, r22
brsh .L5
ld r18, Z+
add r24, r18
adc r25,__zero_reg__
rjmp .L2
.L5:
ret
This may not be the most straight-forward solution, but if you study
this code, you can understand how it works and, hopefully, come with
your own version.
Iif you want something quick and dirty, add the two 8-bit values into an 8-bit register. If the sum is less than the inputs, then make a second 8-bit register equal to 1, otherwise 0. That's how you can do the carry.
The processor should already have something called a carry flag that you can use to this end.
with pencil and paper how do I add two two digit decimal numbers when I was only taught to add two single digit numbers at a time? 12 + 49? I can add the 2+9 = 11 then what do I do? (search for the word carry)

Does a CMP+JE consume more clock cycles than a single MUL?

I'm running an x86 processor, but I believe my question is pretty general. I'm curious about the theoretical difference in clock cycles consumed by a CMP + JE sequence versus a single MUL operation.
In C pseudocode:
unsigned foo = 1; /* must be 0 or 1 */
unsigned num = 0;
/* Method 1: CMP + JE*/
if(foo == 1){
num = 5;
}
/* Method 2: MUL */
num = foo*5; /* num = 0 if foo = 0 */
Don't look too deeply into the pseudocode, it's purely there to illuminate the mathematical logic behind the two methods.
What I'm actually comparing are the following two sequences of instructions:
Method 1: CMP + JE
MOV EAX, 1 ; FOO = 1 here, but can be set to 0
MOV EBX, 0 ; NUM = 0
CMP EAX, 1 ; if(foo == 1)
JE SUCCESS ; enter branch
JMP FINISH ; end program
SUCCESS:
MOV EBX, 5 ; num = 5
FINISH:
Method 2: MUL
MOV EAX, 1 ; FOO = 1 here, but can be set to 0
MOV ECX, EAX ; save copy of FOO to ECX
MUL ECX, 5 ; result = foo*5
MOV EBX, ECX ; num = result = foo*5
It seems that a single MUL (4 total instructions) is more efficient than a CMP + JE (6 total instructions), but are clock cycles consumed equally for instructions -- i.e. is the number of clock cycles it takes to complete an instruction that same for any other instruction?
If the actual clock cycles consumed is dependent on the machine, is a single MUL typically faster than the branching approach on most processors, since it requires fewer total instructions?
Modern CPU performance is much more complicated than just counting the number of cycles for each instruction. You need to take all of the following into account (at least):
Branch prediction
Instruction reordering
Register renaming
Instruction cache hits/misses
Data cache hits/misses
TLB misses/page faults
All of these will be heavily influenced by the surrounding code.
So essentially, it's almost impossible to perform a micro-benchmark like this and obtain a useful result!
However, if I had to guess, I'd say that the code without the JE will be more efficient in general, as it eliminates the branch, which simplifies the branch-prediction behaviour.
Typically, on a modern x86 processor, both the CMP and the MUL instruction will occupy an integer execution unit for one cycle (CMP is essentially a SUB that throws away the result and just modifies the flags register). However, modern x86 processors are also pipelined, superscalar and out-of-order, which means that the performance depends on more than just this underlying cycle cost alone.
If the branch cannot be predicted well, then the branch misprediction penalty will swamp other factors and the MUL version will perform significantly better.
On the other hand, if the branch can be well predicted and you immediately use num in a subsequent calculation, then it's possible for the branching version to perform better in the average case. That's because when it correctly predicts the branch, it can start speculatively executing the next instruction using the predicted value of num, before the result of the compare is available (whereas in the MUL case, subsequent use of num will have a data dependency on the result of the MUL - it won't be able to execute until that result is retired).

Summing 2-dimensional array in EASy68K assembly

100x100 array A of integers, one byte each, is located at A. Write a program segment to compute the sum of the minor diagonal, i.e.
SUM = ΣA[i,99-i], where i=0...99
This is what I have so far:
LEA A, A0
CLR.B D0
CLR.B D1
ADDA.L #99, D0
ADD.B (A0), D1
ADD.B #1, D0
BEQ Done
ADDA.L #99,A0
BRA loop
There's quite many issues in this code, including (but not limited to):
You use 'Loop' and 'Done', but the labels are not shown in the code
You are adding 100 bytes in D1, also as a byte, so you are definitely going to overflow on the results (the target of the sum should at least be 16 bit, so .w or .l addressing)
I'm perhaps wrong but I think the 'minor diagonal' goes from the bottom left to the upper right, while your code goes from the top left to the bottom right of the array
On the performance side:
You should use the 'quick' variant of the 68000 instruction set
Decrement and branch as mentioned by JasonD is more efficient than add/beq
Considering the code was close enough from the solution, here is a variant (I did not test, hope it works)
lea A+99*100,a0 ; Points to the first column of the last row
moveq #0,d0 ; Start with Sum=0
moveq #100-1,d1 ; 100 iterations
Loop
moveq #0,d2 ; Clear register long
move.b (a0),d2 ; Read the byte
add.l d2,d0 ; Long add
lea -99(a0),a0 ; Move one row up and one column right
dbra d1,Loop ; Decrement d1 and branch to Loop until d1 gets negative
Done
; d0 now contains the sum

Resources