I wrote a program in C which calculates the Ackermann values for 2 non-negative integers entered by the user. The program checks if the integers are non-negative and if they are it calculates the Ackermann value of them and then asks for new input or exit. The program works fine in C and I have no problem with it. Here is my code:
int ackermann(int m, int n){
if (m == 0) return n + 1;
if (n == 0) return ackermann(m - 1, 1);
return ackermann(m - 1, ackermann(m, n - 1));
}
BUT, in fact, for the needs of a university lesson we use a modified version of C(basically the same but with some different syntax rules) which simulates the syntax and the rules of MIPS Assembly language. More specifically, we use registers to manipulate all the data except from arrays and structs. Also, we cannot use for, while, or do-while loops and we use if and goto statements instead. So I wrote the following program in this language(as I said it is no more than C with different syntax). My problem is that it works only for (x,0) and (0,y) user inputs(x and y are non-negative numbers). It doesn't work for (4,1), (3,2) and generally all inputs that have no zero. I understand that it cannot work efficiently for very large numbers like (10,10) due to the vast stack of these calculations. But I want it to work for some simple inputs like Ackermann(3,1) == 13. For more on Ackermann function please see this: http://en.wikipedia.org/wiki/Ackermann_function
Here is my code:
//Registers --- The basic difference from C is that we use registers to manipulate data
int R0=0,R1,R2,R3,R4,R5,R6,R7,R8,R9,R10,R11,R12,R13,R14,R15,R16,R17,R18,R19,R20,R21,
R22,R23,R24,R25,R26,R27,R28,R29,R30,R31;
int ackermann(int m, int n){
R4 = m;
R5 = n;
if(R4 != 0)
goto outer_else;
R6 = R5 + 1;
return R6;
outer_else:
if(R5 != 0)
goto inner_else;
R7 = R4 - 1;
R6 = ackermann(R7, 1);
return R6;
inner_else:
R8 = R5 - 1;
R9 = ackermann(R4, R8);
R10 = R4 - 1;
R6 = ackermann(R10, R9);
return R6;
}
I think your problem is that those register values are defined as global variables and they're being updated by an inner call to ackermann(), while an outer call depends on those values not changing. For example, take a look at the inner_else clause in your register version of ackermann(): it calls ackermann(R4, R8), and in the next statement depends on the current value of R4 but the recursive call alters the setting of R4 before it reaches the assignment statement.
Two common solutions:
Define your registers as local variables and let the compiler keep track of per function call state for you.
On entry to your ackermann() function, manually save the state of all registers and then restore same on exit.
Although solution 1 is easier, I suspect your teacher might prefer solution 2, because it illustrates the kind of technique used by a compiler to deal with actual register management in its generated assembly code.
Related
I have just started to learn assembly language at school, and as an exercise I have to make a program that calculate the sum of the first n integers (1+2+3+4+5+...+n).
I managed to build this program but during the comparison (line.9) I only compare the even numbers in register R1, so I would have to do another comparison for the odd numbers in R0.
MOV R0,#1 ; I put a register at 1 to start the sequence
INP R2,2 ; I ask the user up to what number the program should calculate, and I put its response in the R2 register
B myloop ; I define a loop
myloop:
ADD R1,R0,#1 ; I calculate n+1 and put it in register 1
ADD R3,R1,R0 ; I add R0 and R1, and I put the result in the register R3
ADD R0,R1,#1 ; I calculate n+2 and I put in the register R0, and so on...
ADD R4,R4,R3 ; R4 is the total result of all additions between R0 and R1
CMP R1,R2 ; I compare R1 and the maximum number to calculate
BNE myloop ; I only exit the loop if R1 and R2 are equal
STR R4,100 ; I store the final result in memory 100
OUT R4,4 ; I output the final result of the sequence
HALT ; I stop the execution of the program
I've tried several methods but I can't manage to perform this double comparison... (a bit like an "elif" in python)
Basically I would like to add this piece of code to also compare odd numbers:
CMP R0,R2
BNE myloop
But adding this like this directly after comparing even numbers doesn't work no matter if I put "BNE" or not.
You're trying to do a conjunction, in context, something like this:
do {
...
} while ( odd != n && even != n );
...
Eventually, one of those counters should reach the value n and stop the loop. So, both tests must pass in order to continue the loop. However, if either test fails, then the loop should stop.
First, we'll convert this loop into the if-goto-label form of assembly (while still using the C language!):
loop1:
...
if ( odd != n && even != n ) goto loop1;
...
Next, let's break down the conjunction to get rid of it. The intent of the conjunction is that if the first component fails, to stop the loop, without even checking the second component. However, if the first component succeeds, then go on to check the second component. And if the second also succeeds, then, and only then return to the top of the loop (knowing both have succeeded), and otherwise fall off the bottom. Either way, whether the first component fails or the second component fails, the loop stops.
This intent is fairly easy to accomplish in if-goto-label:
loop1:
...
if ( odd == n ) goto endLoop1;
if ( even != n ) goto loop1;
endLoop1:
...
Can you figure out how to follow this logic in assembly?
Another analysis might look like this:
loop1:
...
if ( odd == n || even == n ) goto endLoop1;
goto loop1;
endLoop1:
...
This is the same logic, but stated as how to exit the loop rather than how to continue the loop. The condition is inverted (De Morgan) but the intended target of the if-goto statement is also changed — it is effectively doubly-negated, so holds the same.
From that we would strive to remove the disjunction, again by making two statements instead of one with disjunction, also relatively straightforward using if-goto:
loop1:
...
if ( odd == n ) goto endLoop1;
if ( even == n ) goto endLoop1;
goto loop1;
endLoop1:
...
And with an optimization sometimes known as branch over unconditional branch (the unconditional branch is goto loop1;), we perform a pattern substitution, namely: (1) reversing the condition of the conditional branch, (2) changing the target of the conditional branch to the target of the unconditional branch, and (3) removing the unconditional branch.
loop1:
...
if ( odd == n ) goto endLoop1;
if ( even != n ) goto loop1;
endLoop1:
...
In summary, one takeaway is to understand how powerful the primitive if-goto is, and that it can be composed into conditionals of any complexity.
Another takeaway is that logic can be transformed by pattern matching and substitution into something logically equivalent but more desirable for some purpose like writing assembly! Here we work toward increasing use of simple if-goto's and lessor use of compound conditions.
Also, as #vorrade says, programs generally should not make assumptions about register values — so suggest to load 0 into the registers that need it at the beginning of the program to ensure their initialization. We generally don't clear registers after their use but rather set them before use. So, in another larger program, your code might run with other values from some other code left over in those registers.
Also, I answer the question posed, which is about compound conditionals, and explain how those work in some detail; though as stated elsewhere, there's no need to separate even and odd numbers in order to sum them, and, there's also a single formula that can compute the sum of numbers without iteration (though does require multiplication which may not be directly available and so would require a more bounded iteration..).
First of all your code assumes that R4 is 0 at the beginning.
This might not be true.
Your program becomes simpler and easier to understand if you add each number in a smaller loop, like this:
INP R2,2 ; I ask the user up to what number the program should calculate, and I put its response in the R2 register
MOV R0,#0 ; I put a register at 0 to start the sequence
MOV R4,#0 ; I put a register at 0 to start the sum
B testdone ; Jump straight to test if done
myloop:
ADD R0,R0,#1 ; I calculate n+1 and keep it in register 0
ADD R4,R4,R0 ; I add R4 and R0, and I put the result in the register R4
testdone:
CMP R0,R2 ; I compare R0 and the maximum number to calculate
BNE myloop ; I only exit the loop if R0 and R2 are equal
STR R4,100 ; I store the final result in memory 100
OUT R4,4 ; I output the final result of the sequence
HALT ; I stop the execution of the program
You only need 3 registers: R0 for current n, R2 for limit and R4 for the sum.
However, if you really have to add the even and odd numbers separately, you could do this way:
INP R2,2 ; I ask the user up to what number the program should calculate, and I put its response in the R2 register
MOV R0,#0 ; I put a register at 0 to start the sequence
MOV R4,#0 ; I put a register at 0 to start the sum
B testdone ; Jump straight to test if done
myloop:
ADD R0,R0,#1 ; I calculate n+1 (odd numbers), still register 0
ADD R4,R4,R0 ; I add R4 and R0, keep the result in the register R4
CMP R0,R2 ; I compare R0 and the maximum number to calculate
BEQ done ; I only exit the loop if R0 and R2 are equal
ADD R0,R0,#1 ; I calculate n+1 (even numbers) and keep it in register 0
ADD R4,R4,R0 ; I add R4 and R0, and I put the result in the register R4
testdone:
CMP R0,R2 ; I compare R0 and the maximum number to calculate
BNE myloop ; I only exit the loop if R0 and R2 are equal
done:
STR R4,100 ; I store the final result in memory 100
OUT R4,4 ; I output the final result of the sequence
HALT ; I stop the execution of the program
First of all I would like to thank you very much #vorrade , #Erik Eidt and #Peter Cordes, I read your comments and advice very carefully, and they are very useful to me :)
But in fact following the post of my question I continued to seek by myself a solution to my problem, and I came to develop this code which works perfectly!
// Date : 31/01/2022 //
// Description : A program that calculate the sum of the first n integers (1+2+3+4+5+...+n) //
MOV R0,#1 // I put a register at 1 to start the sequence of odd number
INP R2,2 // I ask the user up to what number the program should calculate, and I put its response in the R2 register
B myloop // I define a main loop
myloop:
ADD R1,R0,#1 // I calculate n+1 (even) and I put in the register R1, and so on...
ADD R3,R1,R0 // I add R0 and R1, and I put the result in the register R3
ADD R4,R4,R3 // R4 is the total result of all additions between R0 and R1, which is the register that temporarily stores the calculated results in order to increment them later in register R4
ADD R0,R1,#1 // I calculate the next odd number to add to the sequence
B test1 // The program goes to the first comparison loop
test1:
CMP R0,R2 // I compare the odd number which is in the current addition with the requested maximum, this comparison can only be true if the maximum is also odd.
BNE test2 // If the comparison is not equal, then I move on to the next test which does exactly the same thing but for even numbers this time.
ADD R4,R4,R2 // If the comparison is equal, then I add a step to the final result because my main loop does the additions 2 by 2.
B final // The program goes to the final loop
test2:
CMP R1,R2 // I compare the even number which is in the current addition with the requested maximum, this comparison can only be true if the maximum is also even.
BNE myloop // If the comparison is not equal, then the program returns to the main loop because this means that all the comparisons (even or odd) have concluded that the maximum has not yet been reached and that it is necessary to continue adding.
B final // The program goes to the final loop
final:
STR R4,100 // I store the final result in memory 100
OUT R4,4 // I output the final result of the sequence
HALT // I stop the execution of the program
I made some comments to explain my process!
I now realize that the decomposition into several loops was indeed the right solution to my problem, it allowed me to better realize the different steps that I had first written down on paper.
Platform: Cortex-M3
IDE: Keil uVision5.10
Hello everyone ~
Here is a simple example:
The function which doesn't return value(let's say function1) in C code:
void add_one(int n)
{
int a = n+1;
}
Its assemble code is:
ADDS r1,r0,#1
BX lr
The function which returns value(let's say function2) in C code:
int add_one(int n)
{
int a = n+1;
return a;
}
and its assemble code is:
MOV r1,r0
ADDS r0,r1,#1
BX lr
As far as I can see, the only difference is that function2 moves parameter n from r0 and then excutes computation while function1 computes directly.
My confusion is that the two function both ends with
BX lr
I know the the effect of the code is to make program jump to another adress which contained in register lr. How can function2 return value? What happened exactly?
The return value is stored in R0. BX LR will jump back to the caller, which knows the function returns something and can now get it from the R0 register. This agreement between caller and callee is named "calling convention".
You should check out the Procedure Call Standard for the ARM Architecture. For example, under 5.4 Result Return, it says:
A word-sized Fundamental Data Type (e.g., int, float) is returned in r0.
Which is exactly what you're seeing.
I'm trying to create a simple loop in assembly to perform an instruction until a certain condition is met. For example, I want to implement this C code in assembly:
int compute_sum(int n)
{
i = 2;
sum = 0;
while(i <= n)
{
sum = sum + i;
i = i + 4;
}
}
The outline I made for myself is this:
/ ADD (compute sum)
/ Increment to keep track of # times passed through loop
/ SNA (skip if difference between user input and number is < 0)
/ BUN xxx (repeat)
I read in user input and have the decimal representation, but do not know the address that should follow BUN so that the instructions are repeated. These are all done in simple computer instructions
You might want to practice getting into the correct mindset by using C without structured conditions (ie using labels and gotos):
i = 2;
sum = 0;
loop:
if (i > n) goto finished;
sum = sum + i;
i = i + 4;
goto loop;
finished:
This is perfectly valid C (albeit archaic) but shows what you need to do at the simplest level. Compare i with n and branch to finished if greater and branch unconditionally to the loop level.
If the assembler language you are using does not have an unconditional branch then you can set the flag and branch (eg SEC, BCS loop) or count on the idea that i will not overflow and when you add 4, branch on no overflow - just make sure it doesn't fail catastrophically if this is not the case.
So, in assembler (which shares the label syntax), you would have:
loop:
cmp i, n ; Or register equivelents
bgt finished
....
add i, 4
bvc loop
finished:
I'm using this (see below) algorithm(take idea from this answer) to code generation from a tree. I'm targeting x86 arch, now I need to deal with mul/div instructions which uses registers eax/ebx as argument.
My question is:
How do I modify this to load operands of a certain instruction to load at fixed register? say, for mul instruction load left and right subtree on eax and ebx registers. My current implementation is: pass current node begin evaluated as argument and if it's MUL or DIV set reg to R0 or R1 according to tree's side, if it's LEFT or RIGHT respectively. If reg is in_use, push reg on stack and mark it as begin free(not implmented yet). The current implementation doesn't work because it does assert in assert(r1 != r2) in emit_load() function (meaning both registers passed as argument are equals like r1 = REG_R0 and r2 = REG_R0)
void gen(AST *ast, RegSet in_use, AST *root) {
if(ast->left != 0 && ast->right != 0) {
Reg spill = NoRegister; /* no spill yet */
AST *do1st, *do2nd; /* what order to generate children */
if (ast->left->n >= ast->right->n) {
do1st = ast->left;
do2nd = ast->right;
} else {
do1st = ast->right;
do2nd = ast->left; }
gen(do1st, in_use);
in_use |= 1 << do1st->reg;
if (all_used(in_use)) {
spill = pick_register_other_than(do1st->reg);
in_use &= ~(1 << spill);
emit_operation(PUSH, spill);
}
gen(do2nd, in_use);
ast->reg = ast->left->reg
emit_operation(ast->type, ast->left->reg, ast->right->reg);
if (spill != NoRegister)
emit_operation(POP, spill);
} else if(ast.type == Type_id || ast.type == Type_number) {
if(node->type == MUL || node->type == DIV) {
REG reg;
if(node_side == ASTSIDE_LEFT) reg = REG_R0;
if(node_side == ASTSIDE_RIGHT) reg = REG_R1;
if(is_reg_in_use(in_use, reg)) {
emit_operation(PUSH, reg);
}
} else {
ast->reg = pick_unused_register(in_use);
emit_load(ast);
}
} else {
print("gen() error");
// error
}
}
// ershov numbers
void label(AST ast) {
if(ast == null)
return;
label(ast.left);
label(ast.right);
if(ast.type == Type_id || ast.type == Type_number)
ast.n = 1;
// ast has two childrens
else if(ast.left not null && ast.right not null) {
int l = ast.left.n;
int r = ast.right.n;
if(l == r)
ast.n = 1 + l;
else
ast.n = max(1, l, r);
}
// ast has one child
else if(ast.left not null && ast.right is null)
ast.n = ast.left.n;
else
print("label() error!");
}
With a one-pass code generator like this, your options are limited. It's probably simpler to generate 3-address code or some other linear intermediate representation first and then worry about register targeting (this is the name for what you're trying to accomplish).
Nonetheless, what you want to do is possible. The caveat is that you won't get very high quality code. To make it better, you'll have to throw away this generator and start over.
The main problem you're experiencing is that Sethi-Ulman labeling is not a code generation algorithm. It's just a way of choosing the order of code generation. You're still missing important ideas.
With all that out of the way, some points:
Pushing and popping registers to save them temporarily makes life difficult. The reason is pretty obvious. You can only get access to the saved values in LIFO order.
Things become easier if you allocate "places" that may be either registers or memory locations in the stack frame. The memory locations effectively extend the register file to make it as large as necessary. A slight complication is that you'll need to remember for each function how many words are required for places in that function's stack frame and backpatch the function preamble to allocate that number.
Next, implement a global operand stack where each stack element is a PLACE. A PLACE is a descriptor telling where an operand that has been computed by already-emitted code is located: register or memory and how to access it. (For better code, you can also allow a PLACE to be a user variable and/or immediate value, but such PLACEs are never returned by the PLACE allocator described below. Additionally, the more kinds of PLACEs you allow, the more cases must be handled by the code emitter, also described below.)
The general principle is "be lazy." The later we can wait to emit code, the more information will be available. With more information, it's possible to generate better code. The stack of PLACEs does a reasonably good job of accomplishing this.
The code generator invariant is that it emits code that leaves the result PLACE at the top of the operand stack.
You will also need a PLACE allocator. This keeps track of registers and the memory words in use. It allocates new memory words if all registers and current words are already busy.
Registers in the PLACE allocator can have three possible statuses: FREE, BUSY, PINNED. A PINNED register is one needed to hold a value that can't be moved. (We'll use this for instructions with specific register requirements.) A BUSY register is one needed for a value that's okay to be moved to a different PLACE as required. A FREE register holds no value.
Memory in the PLACE allocator is either FREE or BUSY.
The PLACE allocator needs at least these entry points:
allocate_register pick a FREE register R, make it BUSY, and return R. If no FREE registers are available, allocate a FREE memory word P, move a BUSY register R's contents there, and return R.
pin_register(R) does as follows: If R is PINNED, raise a fatal error. If R is BUSY, get a FREE PLACE P (either register or memory word), emit code that moves the contents of R to P, mark R PINNED and return. If R is FREE, just mark it PINNED and return.
Note that when pinning or allocating register R requires moving its contents, the allocator must update the corresponding element in the operand stack. What was R must be changed to P. For this purpose, the allocator maintains a map taking each register to the operand stack PLACE that describes it.
With all this complete, the code generator for binary ops will be simple:
gen_code_for(ast_node) {
if (ast_node->left_first) {
gen_code_for(ast_node->left_operand)
gen_code_for(ast_node->right_operand)
} else {
gen_code_for(ast_node->right_operand)
gen_code_for(ast_node->left_operand)
swap_stack_top_2() // get stack top 2 elements in correct order
}
emit_code_for(ast_node)
}
The code emitter will work like this:
emit_code_for(ast_node) {
switch (ast_node->kind) {
case DIV: // An operation that needs specific registers
pin_register(EAX) // Might generate some code to make EAX available
pin_register(EDX) // Might generate some code to make EDX available
emit_instruction(XOR, EDX, EDX) // clear EDX
emit_instruction(MOV, EAX, stack(1)) // lhs to EAX
emit_instruction(DIV, stack(0)) // divide by rhs operand
pop(2) // remove 2 elements and free their PLACES
free_place(EDX) // EDX not needed any more.
mark_busy(EAX) // EAX now only busy, not pinned.
push(EAX) // Push result on operand stack
break;
case ADD: // An operation that needs no specific register.
PLACE result = emit_instruction(ADD, stack(1), stack(0))
pop(2)
push(result)
break;
... and so on
}
}
Finally, the instruction emitter must know what to do when its operands have combinations of types not supported by the processor instruction set. For example, it might have to load a memory PLACE into a register.
emit_instruction(op, lhs, [optional] rhs) {
switch (op) {
case DIV:
assert(RAX->state == PINNED && RDX->state == PINNED)
print_instruction(DIV, lhs)
return RAX;
case ADD:
if (lhs->kind == REGISTER) {
print_instruction(ADD, lhs, rhs)
return lhs
}
if (rhs->kind == REGISTER) {
print_instruction(ADD, rhs, lhs)
return rhs
}
// Both operands are MEMORY
R = allocate_register // Get a register; might emit some code.
print_instruction(MOV, R, lhs)
print_instruction(ADD, R, rhs)
return R
... and so on ...
I've necessarily let out many details. Ask what isn't clear.
OP's Questions Addressed
You're right that I intended stack(n) to be the PLACE that is n from the top of the operand stack.
Leaves of the syntax tree just push a PLACE for a computed value on the operand stack to satisfy the invariant.
As I said above, you can either create special kinds of PLACEs for these operands (user-labeled memory locations and/or immediate values), or - simpler and as you proposed - allocate a register and emit code that loads the value into that register, then push the register's PLACE onto the stack. The simpler method will result in unnecessary load instructions and consume more registers than needed. For example x = x + 1 will generate code something like:
mov esi, [ebp + x]
mov edi, 1
add esi, edi
mov [ebp + x], esi
Here I'm using x to denote the base pointer offset of the variable.
With PLACEs for variables and literals, you can easily get:
mov esi, [ebp + x]
add esi, 1
mov [ebp + x], esi
By making the code generator aware of the PLACE where the assignment needs to put its answer, you can get
add [ebp + x], 1
or equivalently
inc [bp + x]
Accomplish this by adding a parameter PLACE *target to the code generator that describes where the final value of the computed expression value needs to go. If you're not currently compiling an expression, this is set to NULL. Note that with target, the code generator invariant changes: The expression result's PLACE is at the top of the operand stack unless target is set. In that case, it's already been computed into the target's PLACE.
How would this work on x = x + 1? The ASSIGNMENT case in the emit_code_for procedure would provide the target as the PLACE for x when it calls itself recursively to compile x + 1. This delegates responsibility downward for getting the computed value to the right memory location, which is x. The emit_code_for case for ADD ow calls emit_code_for recursively to evaluate the operands x and 1 onto the stack. Since we have PLACEs for user variables and immediate values, these are pushed on the stack while generating no code at all. The ADD emitter must now be smart enough to know that if it sees a memory location L and a literal constant C on the stack and the target is also L, then it can emit add L, C, and it's done.
Remember that every time you make the code generator "smarter" by providing it with more information to make its decisions like this, it will become longer and more complicated because there are more cases to handle.
I have a 2D matrix containing 0,1 and 2. I am writing a cuda kernel where the number of threads is equal to the matrix size and each thread would operate on each element of the matrix. Now, I needed mathematical operations that could keep 0 and 1 as it is, but would convert 2 to 1. That is a mathematical operation, without any if-else, which would do the following conversion : 0 ->0; 1 ->1; 2 ->1. Is there any possible way using mathematical operators which would do the above mentioned conversion. Any help would be extremely appreciated. Thank you.
This is not a cuda question.
int A;
// set A to 0, 1, or 2
int a = (A + (A>>1)) & 1;
// a is now 0 if A is 0, or 1 if A is 1 or 2
or as a macro:
#define fix01(x) ((x+(x>>1))&1)
int a = fix01(A);
This also seems to work:
#define fix01(x) ((x&&1)&1)
I don't know if the use of the boolean AND operator (&&) fits your definition of "mathematical operations".
As the question was about "mathematical" functions I suggest the following 2nd order polynomial:
int f(int x) { return ((3-x)*x)/2; }
But if you want avoid branching in order to maximize speed: There is a min instruction since PTX ISA 1.0. (See Tab. 36 in the PTX ISA 3.1 manual.) So the following CUDA code
__global__ void test(int *x, int *y)
{
*y = *x <= 1 ? *x : 1;
}
compiles to the following PTX assembler in my test (just called nvcc from CUDA 5 without any arch options)
code for sm_10
Function : _Z4testPiS_
/*0000*/ /*0x1000c8010423c780*/ MOV R0, g [0x4];
/*0008*/ /*0xd00e000580c00780*/ GLD.U32 R1, global14 [R0];
/*0010*/ /*0x1000cc010423c780*/ MOV R0, g [0x6];
/*0018*/ /*0x30800205ac400780*/ IMIN.S32 R1, R1, c [0x1] [0x0];
/*0020*/ /*0xd00e0005a0c00781*/ GST.U32 global14 [R0], R1;
So a min() implementation using a conditional ?: actually compiles to a single IMIN.S32 PTX instruction without any branching. So I'd recommend this for any real-world applications:
int f(int x) { return x <= 1 ? x : 1; }
But back to the question of using only non-branching operations:
Another form of getting this result in C is by using two not operators:
int f(int x) { return !!x; }
Or simply compare with zero:
int f(int x) { return x != 0; }
(The results of ! and != are guaranteed to be 0 or 1, compare Sec. 6.5.3.3 Par. 5 and Sec. 6.5.9 Par. 3 of the C99 standard, ISO/IEC 9899:1999. Afair this guarantee also holds in CUDA.)