So I'm compiling a subset of C to a simple stack VM for learning purposes and I'd like to know how to best compile a switch statement, e.g.
switch (e) {
case 0: { ... }
case 1: { ... }
...
case k: { ... }
}
The book I'm going through offers a simple way to compile it with indexed jumps but the simple scheme described in the book only works for contiguous, ascending ranges of case values.
Right now I'm using symbolic labels for the first pass and during the second pass I'm going to resolve the labels to actual jump targets because having labels simplifies the initial compilation to stack instructions quite a bit. My idea right now is to generalize what's in the book to any sequence of case values in any order with the following scheme. Call the case values c1, c2, ..., cn and the corresponding labels j1, j2, ..., jn then generate the following sequence of instructions assuming the value of e is on top of the stack:
dup, loadc c1, eq, jumpnz j1,
dup, loadc c2, eq, jumpnz j2,
...
dup, loadc cn, eq, jumpnz jn,
pop, jump :default
j1: ...
j2: ...
...
jn: ...
default: ...
The notation is hopefully clear but if not: dup = duplicate value on top of stack, loadc c = push a constant c on top of the stack, eq = compare top two values on the stack and push 0 or 1 based on the comparison, jumpnz j = jump to the label j if the top stack value is not 0, label: = something that will be resolved to an actual address during second compilation pass.
So my question is then what are some other ways of compiling switch statements? My way of doing it is much less compact than an indexed jump table for contiguous ranges of case values but works out pretty well when there are large gaps.
First sort all the cases. Then identify all (large enough to be worthwhile) continguous or near-contiguous sequences, and treat them as a single unit that is handled with a jump table. Then, instead of your linear sequence of compares and jumps, use a balanced binary tree of branches to minimize the average number of jumps. You do this by comparing against the median of the cases or blocks of cases, recursively.
Related
I'm a beginner to assembly language. Can you guys help me to guide the steps to complete this assignment?
The equation is : Sigma notation, for i = 0 to N-1 ((-3 +a(i)) +(b(i) -14))
here is the picture
The task in the main section is to explicitly follow the equation and iteratively add -3 to the indexed value in array a, subtract the indexed value in array b by 14 and finally add
these two parts and store the resulting value in memory location result.
Use a maximum of three general purpose registers in this lab. NOT allowed to change any values in the memory locations of aand b. Some of the opcodes of use in this lab are:
•mov - moving data from register-register, register-variable etc
•lea - loading effective address of a variable to a register.
•add - adding two values in registers or in variables.
•sub - subtract two values in registers or in variables
Upon completion of the task, zero out all the used registers and return.
I got the segment.data
segment .data
a dw -4, 22, 144 ; array of 3 values
b db -3, -16, 12 ; array of 3 values
result dq 0 ; memory to result
segment .text
global main
main:
Can you guys help me to guide the steps to complete this assignment?
First, I would write out a proper C version of that, so you can see the detail that has to be written out in programming languages, but hidden by the Sigma notation. It will only be about 3 lines of code in total, and that will add clarity.
Next, fundamentally, the approach is to decompose the problem (i.e. that C code), into smaller piece parts, then solve each piece part, and recompose that into a total solution.
It is good to start with is understanding the data first; here variables i, N, result, a, b and possibly a temporary or two if you need to compute an intermediate value. That's more than 3 items so, they cannot all go in registers. However global variables can be accessed without requiring registers (a, b, and result are global variables).
Next write out the code and solve each statement, and solve each expression in each statement. Once you know where your variables are you can write instructions that work with them.
The idea is to first map the variables of the algorithm into physical storage of the processor: either CPU registers or memory. Once you have that mapping, you can start to write assembly instructions, but the other way around is awkward (though you may have to iterate if you run out of registers mid way, for example).
For example, the basic for-loop can be translated to a while loop, then to an if-goto-label loop.
for ( int i = 0; i < N; i++ )
<loop-body>
int i = 0;
while ( i < N ) {
<loop-body>
i++;
}
int i = 0;
loop1:
if ( i >= N ) goto endLoop1;
<loop-body>
i++;
goto loop1;
endLoop1:
The above if-goto-label form is pretty close to assembly, and has the control structure of the original for-loop.
Next, break down the loop-body into its individual piece parts, which will include array referencing, addition, subtraction, and summation. So, figure out each of those and then place them together in context of the formula you're working.
Compose all of that into a solution and you'll have your program. If you find you run out of registers, given a working limit of 3, you can take a step back and figure out something to put in memory instead.
Let's say I have two pointers to two structs a and b. These two structs each contain an enum, which we'll call x. Given any possible a and b, I want to call a specific function based on the values of their x enums.
What is interesting in my case is that the functions that I want to call look like:
X0_to_X1();
X0_to_X2();
...
X1_to_X0();
...
etc
where X0, X1 etc are possible values of the enum x, meaning that there are X_to_Y functions for every and each combination of the values of the x enum.
The obvious "naive" solution to this would be a switch statement which would be quite big (given that x has quite a few possible values):
switch (a->x) {
case X0:
switch (b->x) {
case X1:
X0_to_X1();
break;
// ... and so on and so forth for every possible pair!
My first attempt at solving this a bit more elegantly was to implement a macro that, given two values of x, could form a function call:
#define CALL_FUNCTION(x1, x2) x1 ## _to_ ## x2 ()
This however does not work, as in my code I never can know the actual values of x before runtime, so it ends up looking like:
CALL_FUNCTION(a->x, b->x);
which of course gets converted to:
a->x_to_b->x();
which makes absolutely no sense.
Is there a way to solve this problem more elegantly, or should I just bite the bullet and implement the enormous switch statement instead?
This problem screams for a lookup table, where you store pointers to the various functions and they are keyed by the enumeration values.
If your enum values are sequential (and no two enumeration constants share the same value), then you can build a lookup table as a simple 2D array:
enum x_t { X0, X1, X2, ..., NUM_X };
void (*lookup[NUM_X][NUM_X])(void) = {
{ NULL, X0_to_X1, X0_to_X2, X0_to_X3, ... },
{ X1_to_X0, NULL, X1_to_X2, X1_to_X3, ... },
{ X2_to_X0, X2_to_X1, NULL, X2_to_X3, ... },
...
};
That assumes you don’t have an "identity” function when your x and y are the same.
Then, you call the desired function by indexing into to table like so:
if ( x != y )
lookup[x][y]();
No, it isn’t pretty, but it beats nested switch statements. You can hide that behind a macro or another function call if you wish.
If your enumeration values aren’t sequential then this particular implementation won’t work - you’d have to build your lookup table a different way, using lists or sparse matrices. But while the setup code may be tedious, it will greatly simplify the logic on the calling side.
I have a simple task of expanding the string FX according to the following rules:
X -> X+YF+
Y-> -FX-Y
In OpenCL, string manipulation is not supported but the use of an array of characters is. How would a kernel program that expands this string in parallel look like in openCL?
More details:
Consider the expansion of 'FX' in the python code below.
axiom = "FX"
def expand(s):
switch = {
"X": "X+YF+",
"Y": "-FX-Y",
}
return switch.get(s, s)
def expand_once(string):
return [expand(c) for c in string]
def expand_n(s, n):
for i in range(n):
s = ''.join(expand_once(s))
return s
expanded = expand_n(axiom, 200)
The result expanded will be a result of expanding the axiom 'FX' 200 times. This is a rather slow process thus the need to do it on openCL for parallelization.
This process results in an array of strings which I will then use to draw a dragon curve.
below is an example of how I would come up with such a dragon curve: This part is not of much importance. The expansion on OpenCL is the crucial part.
import turtles
from PIL import Image
turtles.setposition(5000, 5000)
turtles.left(90) # Go up to start.
for c in expanded:
if c == "F":
turtles.forward(10)
elif c == "-":
turtles.left(90)
elif c == "+":
turtles.right(90)
# Write out the image.
im = Image.fromarray(turtles.canvas)
im.save("dragon_curve.jpg")
Recursive algorithms like this don't especially lend themselves to GPU acceleration, especially as the data set changes its size on each iteration.
If you do really need to do this iteratively, the challenge is for each work-item to know where in the output string to place its result. One way to do this would be to assign work groups a specific substring of the input, and on every iteration, keep count of the total number of Xs and Ys in each workgroup-sized substring of the output. From this you can calculate how much that substring will expand in one iteration, and if you accumulate those values, you'll know the offset of the output of each substring expansion. Whether this is efficient is another question. :-)
However, your algorithm is actually fairly predictable: you can calculate precisely how large the final string will be given the initial string and number of iterations. The best way to generate this string with OpenCL would be to come up with a non-recursive function which analytically calculates the character at position N given M iterations, and then call that function once per work-item, with the (known!) final length of the string as the work size. I don't know if it's possible to come up with such a function, but it seems like it might be, and if it is possible, this is probably the most efficient way to do it on a GPU.
It seems like this might be possible: as far as I can tell, the result will be highly periodic:
FX
FX+YF+
FX+YF++-FX-YF+
FX+YF++-FX-YF++-FX+YF+--FX-YF+
FX+YF++-FX-YF++-FX+YF+--FX-YF++-FX+YF++-FX-YF+--FX+YF+--FX-YF+
^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
A* B A B A B A B
As far as I can see, those A blocks are all identical, and so are the Bs. (apart from the first A which is effectively at position -1) You can therefore determine the characters at 14 positions out of every 16 completely deterministically. I strongly suspect it's possible to work out the pattern of +s and -s that connects them too. If you figure that out, the solution becomes pretty easy.
Note though that when you have that function, you probably don't even need to put the result in a giant string: you can just feed your drawing algorithm with that function directly.
I am working on a program which generates C code for one function. This generated C function resides in the central loop of another target program; this function is performance sensitive. The generated function is used to call another function, based on a bool value -- this boolean value is fetched using 2 ints passed to the generated function: a state number and a mode number. Generated function looks like so:
void dispatch(System* system, int state, int mode) {
// Some other code here...
if (truthTable[state][mode]) {
doExpensiveCall(system, state, mode);
}
}
Some facts:
The range of 'state' and 'mode' values start at 0, and end at some number < 10,000. Their possible values are sequential, with no gaps in between. So, for example, if the end value of 'state' is 1000, then we know that there are 1001 and states (including state 0).
The code generator is aware of the states and modes, and it knows ahead of time which combination of state+mode will yield a value of true. Theoretically, any combination of state+mode could yield true value, and thus make a call to doExpensiveCall, but in practive it will mostly be a handful of state+mode combinations that will yield a value of true. Again, this info is known during the code generation.
Since this function will be called alot, I want to optimize the check for the truth value. In the average case, I expect the test to yield false for the vast percetage of time. On average, I expect that less than 1% of the calls will yield a value of true. But, theoretically, it could be as hight as 100% of the time (this point depends on the end-user).
I am exploring the different ways that I could compute whether a state+mode will yied a call to doExpensiveCall(). In the end, I'll have to choose something, so I'm exploring my options now. These are the different ways that I could think of so far:
1) Create a precomputed dual dimensional array, which contains booleans. This is what I'm using in the example above. This yields the fastest possible check that I can think of. The problem is that if state and mode have large ranges (say 10,000x1000), the generated table starts becomming very big (in the case of 10,000x1000, thats 10MB for just that table). Example:
// STATE_COUNT=4, MODE_COUNT=3
static const char truthTable[STATE_COUNT][MODE_COUNT] = {
{0,1,0},
{0,0,0},
{1,1,0},
{0,0,1}
}
2) Create a table like #1, but compressed: instead of each array entry being a single boolean, it would be a char bitfield. Then, during the check, I would do some computation with state+mode to decide how to index into the array. This reduces the size of the precomputed table by MODE_MODE/8. The downside is that the reduction is not that much, and now theres is now need compute the index of the boolean in the bitfield table, instead of just a simple array access as in the case in #1.
3) Since the amount of state+mode combinations that will yield a value of true is expected to be small, a switch statement is also possible (using the truthTable in #1 as reference):
switch(state){
case 0: // row
switch(mode){ // col
case 1: doExpensiveCall(system, state, mode);
break;
}
break;
case 2:
switch(mode){
case 0:
case 1: doExpensiveCall(system, state, mode);
break;
}
break;
case 3:
switch(mode){
case 2: doExpensiveCall(system, state, mode);
break;
}
break;
}
QUESTION:
What are other ways that, given the facts above, can be used calcuate this boolean value needed to call doExpensiveCall()?
Thanks
Edit:
I though about Jens sample code, and the following occurred to me. In order to have just one switch statement, I can do this computation in the generated code:
// #if STATE_COUNT > MODE_COUNT
int i = s * STATE_COUNT + m;
// #else
int i = m * MODE_COUNT + s;
// #endif
switch(i) {
case 1: // use computed values here, too.
case 8:
case 9:
case 14:
doExpensiveCall(system, s, m);
}
I'd try to use a modified version of (3), where you actually have only one call, and all the switch/case stuff leads to that call. By that you can ensure that the compiler will choose whatever heuristics he has for optimizing this.
Something along the line of
switch(state) {
default: return;
case 0: // row
switch(mode){ // col
default: return;
case 1: break;
}
break;
case 2:
switch(mode){
default: return;
case 0: break;
case 1: break;
}
break;
case 3:
switch(mode){
default: return;
case 2: break;
}
break;
}
doExpensiveCall(system, state, mode);
That is, you'd only have "control" inside the switch. The compiler should be able to sort this out nicely.
These heuristics will probably be different between architectures and compilation options (e.g -O3 versus -Os) but this is what compilers are for, make choices based on platform specific knowledge.
And for your reference to time effeciency, if your function call is really expensive as you claim, this part will just be burried in the noise, don't worry about it. (Or otherwise benchmark your code to be sure.)
If the code generator knows the percentage of the table that's in use it can choose the algorithm at build time.
So if it is about 50% true/false use the 10 MB table.
Otherwise use a hash table or a radix tree.
A hash table would choose a hash function and a number of buckets. You'd compute the hash, find the bucket and search the chain for the true (or false) values.
The radix tree would choose a radix (like 10) and you'd have 10 entries with pointers to NULL (no true values down there) and one would have a pointer to another 10 entries, until you finally reach a value.
I am familiar with iterative methods on paper, but MATLAB coding is relatively new to me and I cannot seem to find a way to code this.
In code language...
This is essentially what I have:
A = { [1;1] [2;1] [3;1] ... [33;1]
[1;2] [2;2] [3;2] ... [33;2]
... ... ... ... ....
[1;29] [2;29] [3;29] ... [33;29] }
... a 29x33 cell array of 2x1 column vectors, which I got from:
[X,Y] = meshgrid([1:33],[1:29])
A = squeeze(num2cell(permute(cat(3,X,Y),[3,1,2]),1))
[ Thanks to members of stackOverflow who helped me do this ]
I have a function that calls each of these column vectors and returns a single value. I want to institute a 2-D 5-point stencil method that evaluates a column vector and its 4 neighbors and finds the maximum value attained through the function out of those 5 column vectors.
i.e. if I was starting from the middle, the points evaluated would be :
1.
A{15,17}(1)
A{15,17}(2)
2.
A{14,17}(1)
A{14,17}(2)
3.
A{15,16}(1)
A{15,16}(2)
4.
A{16,17}(1)
A{16,17}(2)
5.
A{15,18}(1)
A{15,18}(2)
Out of these 5 points, the method would choose the one with the largest returned value from the function, move to that point, and rerun the method. This would continue on until a global maximum is reached. It's basically an iterative optimization method (albeit a primitive one). Note: I don't have access to the optimization toolbox.
Thanks a lot guys.
EDIT: sorry I didn't read the iterative part of your Q properly. Maybe someone else wants to use this as a template for a real answer, I'm too busy to do so now.
One solution using for loops (there might be a more elegant one):
overallmax=0;
for v=2:size(A,1)-1
for w=2:size(A,2)-1
% temp is the horizontal part of the "plus" stencil
temp=A((v-1):(v+1),w);
tmpmax=max(cat(1,temp{:}));
temp2=A(v,(w-1):(w+1));
% temp2 is the vertical part of the "plus" stencil
tmpmax2=max(cat(1,temp2{:}));
mxmx=max(tmpmax,tmpmax2);
if mxmx>overallmax
overallmax=mxmx;
end
end
end
But if you're just looking for max value, this is equivalent to:
maxoverall=max(cat(1,A{:}));