In an assignment for a parallel computing class we have been assigned to program Sparse Binary Matrix-Matrix multiplication (SpGEMM) in C. Julia has a relatively easy to follow implementation based on Gustavson's algorithm that works great.
Thing is we also need to do the multiplication in block form, which I already did, but I don't really see any speedup in doing so. From what I understand you're supposed to use
the result of A(i,k)*B(k,j), where (i,j) are coordinates in the block matrix, as a mask/filter for the next block multiplication in the sum C(i,j) = Σ( A(i,k)*B(k,j) ).
Julia's implementation though, which I followed, already has a dense boolean array when computing each row that acts as a "flag" for when not to add something again in the resulting matrix.
My question is, is there any merit in turning this into block matrix multiplication or is there something that I might be doing wrong myself.
Keep in mind my C code currently runs in half the time Matlab takes in multiplying a 5,000,000 x 5,000,000 sparse matrix. The blocked version, which I really tried to optimize and I'm also doing in the Gustavson order, gets slower and slower the smaller the block-size is set.
Here is my current code
//C=D+(A*B) (basically OR)
bool SpGEMM_dor(int *Acol, int *Arow, int An,
int *Bcol, int *Brow, int Bm,
int **Ccol, int *Crow, int *Csize,//output
int *Dcol, int *Drow)//previous
{
//printCSR(Arow,Acol,An,An,An);
int nnzcum=0;
bool *xb = calloc(An,sizeof(bool)); //boolean flag
for(int i=0; i<An; i++){
int nnzpv = nnzcum;//nnz of previous row;
Crow[i] = nnzcum;
if(nnzcum + An > *Csize){ //make sure theres enough space
*Csize += MAX(An, *Csize/4);
*Ccol = realloc(*Ccol,*Csize*sizeof(int));
}
//---OR---
//add previous row items in order to exist in the next block
for(int jj=Drow[i]; jj<Drow[i+1]; jj++){
int j = Dcol[jj];
xb[j] = true;
(*Ccol)[nnzcum] = j;
nnzcum++;
}
//--------
//add new row items
for(int jj=Arow[i]; jj<Arow[i+1]; jj++){
int j = Acol[jj];
for(int kp=Brow[j]; kp<Brow[j+1]; kp++){
int k = Bcol[kp];
if(!xb[k]){
xb[k] = true;
(*Ccol)[nnzcum] = k;
nnzcum++;
}
}
}
if(nnzcum > nnzpv){
quickSort(*Ccol,nnzpv,nnzcum-1);
for(int p=nnzpv; p<nnzcum; p++){
xb[ (*Ccol)[p] ] = false;
}
}
}
Crow[An] = nnzcum;
free(xb);
return Crow[An];
}
The part of code that I have inside of the ----OR---- section only happens in the block version in order to add the previous block to the now-calculating one. It basically does C = D+(A*B). I've also tried calculating the next block and then merging the 2 sorted arrays of each row of the 2 CSR matrices, which seems to be slower. Also all matrices are in CSR format.
I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:
const int UNROLL = 4;
void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < vec2->dim.rows; j++) {
__m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
for (int x = 0; x<UNROLL; x++) {
tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2);
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < mat2->dim.cols; j++) {
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int k = 0; k < mat1->dim.cols; k++) {
__m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
}
}
for (int x = 0; x < UNROLL; x++) {
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
edited:
Here is the struct of matrix. I didn't modified it.
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
edited:
I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?
In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:
for (int k = 0; k < mat1->dim.cols; k++) {
for (int x = 0; x < UNROLL; x++) {
...
_mm256_load_ps(&mat1->data[i+x*8][k])
}
}
_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.
If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)
You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.
Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.
But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.
You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.
gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).
Related:
How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.
Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.
I'm building a suite of functions to work with a multidimensional-array data structure and I want to be able to define arbitrary slices of the arrays so I can implement a generalized inner product of two arbitrary matrices (aka Tensors or n-d arrays).
An APL paper I read (I honestly can't find which -- I've read so many) defines the matrix product on left-matrix X with dimensions A;B;C;D;E;F and right-matrix Y with dimensions G;H;I;J;K where F==G as
Z <- X +.× Y
Z[A;B;C;D;E;H;I;J;K] <- +/ X[A;B;C;D;E;*] × Y[*;H;I;J;K]
where +/ is sum of, and × applies element-by-element to two vectors of the same length.
So I need "row" slices of the left and "column" slices of the right. I can of course take a transpose and then a "row" slice to simulate a "column" slice, but I'd rather do it more elegantly.
Wikipedia's article on slicing leads to a stub about dope vectors which seem to be the miracle cure I'm looking for, but there's not a lot there to go on.
How do I use a dope vector to implement arbitrary slicing?
(Much later I did notice Stride of an array which has some details.)
Definition
General array slicing can be implemented (whether or not built into the language) by referencing every array through a dope vector or descriptor — a record that contains the address of the first array element, and then the range of each index and the corresponding coefficient in the indexing formula. This technique also allows immediate array transposition, index reversal, subsampling, etc. For languages like C, where the indices always start at zero, the dope vector of an array with d indices has at least 1 + 2d parameters.
http://en.wikipedia.org/wiki/Array_slicing#Details
That's a dense paragraph, but it's actually all in there. So we need a data structure like this:
struct {
TYPE *data; //address of first array element
int rank; //number of dimensions
int *dims; //size of each dimension
int *weight; //corresponding coefficient in the indexing formula
};
Where TYPE is whatever the element type is, the field of the matrices. For simplicity and concreteness, we'll just use int. For my own purposes, I've devised an encoding of various types into integer handles so int does the job for me, YMMV.
typedef struct arr {
int rank;
int *dims;
int *weight;
int *data;
} *arr;
All of the pointer members can be assigned locations within the
same allocated block of memory as the struct itself (which we will
call the header). But by replacing the earlier use of offsets
and struct-hackery, the implementation of algorithms can be made
independent of that actual memory layout within (or without) the
block.
The basic memory layout for a self-contained array object is
rank dims weight data
dims[0] dims[1] ... dims[rank-1]
weight[0] weight[1] ... weight[rank-1]
data[0] data[1] ... data[ product(dims)-1 ]
An indirect array sharing data (whole array or 1 or more row-slices)
will have the following memory layout
rank dims weight data
dims[0] dims[1] ... dims[rank-1]
weight[0] weight[1] ... weight[rank-1]
//no data! it's somewhere else!
And an indirect array containing an orthogonal slice along
another axis will have the same layout as a basic indirect array,
but with dims and weight suitably modified.
The access formula for an element with indices (i0 i1 ... iN)
is
a->data[ i0*a->weight[0] + i1*a->weight[1] + ...
+ iN*a->weight[N] ]
, assuming each index i[j] is between [ 0 ... dims[j] ).
In the weight vector for a normally laid-out row-major array, each element is the product of all lower dimensions.
for (int i=0; i<rank; i++)
weight[i] = product(dims[i+1 .. rank-1]);
So for a 3×4×5 array, the metadata would be
{ .rank=3, .dims=(int[]){3,4,5}, .weight=(int[]){4*5, 5, 1} }
or for a 7×6×5×4×3×2 array, the metadata would be
{ .rank=6, .dims={7,6,5,4,3,2}, .weight={720, 120, 24, 6, 2, 1} }
Construction
So, to create one of these, we need the same helper function from the previous question to compute the size from a list of dimensions.
/* multiply together rank integers in dims array */
int productdims(int rank, int *dims){
int i,z=1;
for(i=0; i<rank; i++)
z *= dims[i];
return z;
}
To allocate, simply malloc enough memory and set the pointers to the appropriate places.
/* create array given rank and int[] dims */
arr arraya(int rank, int dims[]){
int datasz;
int i;
int x;
arr z;
datasz=productdims(rank,dims);
z=malloc(sizeof(struct arr)
+ (rank+rank+datasz)*sizeof(int));
z->rank = rank;
z->dims = z + 1;
z->weight = z->dims + rank;
z->data = z->weight + rank;
memmove(z->dims,dims,rank*sizeof(int));
for(x=1, i=rank-1; i>=0; i--){
z->weight[i] = x;
x *= z->dims[i];
}
return z;
}
And using the same trick from the previous answer, we can make a variable-argument interface to make usage easier.
/* load rank integers from va_list into int[] dims */
void loaddimsv(int rank, int dims[], va_list ap){
int i;
for (i=0; i<rank; i++){
dims[i]=va_arg(ap,int);
}
}
/* create a new array with specified rank and dimensions */
arr (array)(int rank, ...){
va_list ap;
//int *dims=calloc(rank,sizeof(int));
int dims[rank];
int i;
int x;
arr z;
va_start(ap,rank);
loaddimsv(rank,dims,ap);
va_end(ap);
z = arraya(rank,dims);
//free(dims);
return z;
}
And even automatically produce the rank argument by counting the other arguments using the awesome powers of ppnarg.
#define array(...) (array)(PP_NARG(__VA_ARGS__),__VA_ARGS__) /* create a new array with specified dimensions */
Now constructing one of these is very easy.
arr a = array(2,3,4); // create a dynamic [2][3][4] array
Accessing elements
An element is retrieved with a function call to elema which multiplies each index by the corresponding weight, sums them, and indexes the data pointer. We return a pointer to the element so it can be read or modified by the caller.
/* access element of a indexed by int[] */
int *elema(arr a, int *ind){
int idx = 0;
int i;
for (i=0; i<a->rank; i++){
idx += ind[i] * a->weight[i];
}
return a->data + idx;
}
The same varargs trick can make an easier interface elem(a,i,j,k).
Axial Slices
To take a slice, first we need a way of specifying which dimensions to extract and which to collapse. If we just need to select a single index or all elements from a dimension, then the slice function can take rank ints as arguments and interpret -1 as selecting the whole dimension or 0..dimsi-1 as selecting a single index.
/* take a computed slice of a following spec[] instructions
if spec[i] >= 0 and spec[i] < a->rank, then spec[i] selects
that index from dimension i.
if spec[i] == -1, then spec[i] selects the entire dimension i.
*/
arr slicea(arr a, int spec[]){
int i,j;
int rank;
for (i=0,rank=0; i<a->rank; i++)
rank+=spec[i]==-1;
int dims[rank];
int weight[rank];
for (i=0,j=0; i<rank; i++,j++){
while (spec[j]!=-1) j++;
if (j>=a->rank) break;
dims[i] = a->dims[j];
weight[i] = a->weight[j];
}
arr z = casta(a->data, rank, dims);
memcpy(z->weight,weight,rank*sizeof(int));
for (j=0; j<a->rank; j++){
if (spec[j]!=-1)
z->data += spec[j] * a->weight[j];
}
return z;
}
All the dimensions and weights corresponding to the -1s in the argument array are collected and used to create a new array header. All arguments >= 0 are multiplied by their associated weight and added to the data pointer, offsetting the pointer to the correct element.
In terms of the array access formula, we're treating it as a polynomial.
offset = constant + sum_i=0,n( weight[i] * index[i] )
So for any dimension from which we're selecting a single element (+ all lower dimensions), we factor-out the now-constant index and add it to the constant term in the formula (which in our C representation is the data pointer itself).
The helper function casta creates the new array header with shared data. slicea of course changes the weight values, but by calculating weights itself, casta becomes a more generally usable function. It can even be used to construct a dynamic array structure that operates directly on a regular C-style multidimensional array, thus casting.
/* create an array header to access existing data in multidimensional layout */
arr casta(int *data, int rank, int dims[]){
int i,x;
arr z=malloc(sizeof(struct arr)
+ (rank+rank)*sizeof(int));
z->rank = rank;
z->dims = z + 1;
z->weight = z->dims + rank;
z->data = data;
memmove(z->dims,dims,rank*sizeof(int));
for(x=1, i=rank-1; i>=0; i--){
z->weight[i] = x;
x *= z->dims[i];
}
return z;
}
Transposes
The dope vector can also be used to implement transposes. The order of the dimensions (and indices) can be changed.
Remember that this is not a normal 'transpose' like everybody else
does. We don't rearrange the data at all. This is more
properly called a 'dope-vector pseudo-transpose'.
Instead of changing the data, we just change the
constants in the access formula, rearranging the
coefficients of the polynomial. In a real sense, this
is just an application of the commutativity and
associativity of addition.
So for concreteness, assume the data is arranged
sequentially starting at hypothetical address 500.
500: 0
501: 1
502: 2
503: 3
504: 4
505: 5
506: 6
if a is rank 2, dims {1, 7), weight (7, 1), then the
sum of the indices multiplied by the associated weights
added to the initial pointer (500) yield the appropriate
addresses for each element
a[0][0] == *(500+0*7+0*1)
a[0][1] == *(500+0*7+1*1)
a[0][2] == *(500+0*7+2*1)
a[0][3] == *(500+0*7+3*1)
a[0][4] == *(500+0*7+4*1)
a[0][5] == *(500+0*7+5*1)
a[0][6] == *(500+0*7+6*1)
So the dope-vector pseudo-transpose rearranges the
weights and dimensions to match the new ordering of
indices, but the sum remains the same. The initial
pointer remains the same. The data does not move.
b[0][0] == *(500+0*1+0*7)
b[1][0] == *(500+1*1+0*7)
b[2][0] == *(500+2*1+0*7)
b[3][0] == *(500+3*1+0*7)
b[4][0] == *(500+4*1+0*7)
b[5][0] == *(500+5*1+0*7)
b[6][0] == *(500+6*1+0*7)
Inner Product (aka Matrix Multiplication)
So, by using general slices or transpose+"row"-slices (which are easier), generalized inner product can be implemented.
First we need the two helper functions for applying a binary operation to two vectors producing a vector result, and reducing a vector with a binary operation producing a scalar result.
As in the previous question we'll pass in the operator, so the same function can be used with many different operators. For the style here, I'm passing operators as single characters, so there's already an indirect mapping from C operators to our function's operators. This is an x-macro table.
#define OPERATORS(_) \
/* f F id */ \
_('+',+,0) \
_('*',*,1) \
_('=',==,1) \
/**/
#define binop(X,F,Y) (binop)(X,*#F,Y)
arr (binop)(arr x, char f, arr y); /* perform binary operation F upon corresponding elements of vectors X and Y */
The extra element in the table is for the reduce function for the case of a null vector argument. In that case, reduce should return the operator's identity element, 0 for +, 1 for *.
#define reduce(F,X) (reduce)(*#F,X)
int (reduce)(char f, arr a); /* perform binary operation F upon adjacent elements of vector X, right to left,
reducing vector to a single value */
So the binop does a loop and a switch on the operator character.
/* perform binary operation F upon corresponding elements of vectors X and Y */
#define BINOP(f,F,id) case f: *elem(z,i) = *elem(x,i) F *elem(y,i); break;
arr (binop)(arr x, char f, arr y){
arr z=copy(x);
int n=x->dims[0];
int i;
for (i=0; i<n; i++){
switch(f){
OPERATORS(BINOP)
}
}
return z;
}
#undef BINOP
And the reduce function does a backwards loop if there are enough elements, having set the initial value to the last element if there was one, having preset the initial value to the operator's identity element.
/* perform binary operation F upon adjacent elements of vector X, right to left,
reducing vector to a single value */
#define REDID(f,F,id) case f: x = id; break;
#define REDOP(f,F,id) case f: x = *elem(a,i) F x; break;
int (reduce)(char f, arr a){
int n = a->dims[0];
int x;
int i;
switch(f){
OPERATORS(REDID)
}
if (n) {
x=*elem(a,n-1);
for (i=n-2;i>=0;i--){
switch(f){
OPERATORS(REDOP)
}
}
}
return x;
}
#undef REDID
#undef REDOP
And with these tools, inner product can be implemented in a higher-level manner.
/* perform a (2D) matrix multiplication upon rows of x and columns of y
using operations F and G.
Z = X F.G Y
Z[i,j] = F/ X[i,*] G Y'[j,*]
more generally,
perform an inner product on arguments of compatible dimension.
Z = X[A;B;C;D;E;F] +.* Y[G;H;I;J;K] |(F = G)
Z[A;B;C;D;E;H;I;J;K] = +/ X[A;B;C;D;E;*] * Y[*;H;I;J;K]
*/
arr (matmul)(arr x, char f, char g, arr y){
int i,j;
arr xdims = cast(x->dims,1,x->rank);
arr ydims = cast(y->dims,1,y->rank);
xdims->dims[0]--;
ydims->dims[0]--;
ydims->data++;
arr z=arraya(x->rank+y->rank-2,catv(xdims,ydims)->data);
int datasz = productdims(z->rank,z->dims);
int k=y->dims[0];
arr xs = NULL;
arr ys = NULL;
for (i=0; i<datasz; i++){
int idx[x->rank+y->rank];
vector_index(i,z->dims,z->rank,idx);
int *xdex=idx;
int *ydex=idx+x->rank-1;
memmove(ydex+1,ydex,y->rank);
xdex[x->rank-1] = -1;
free(xs);
free(ys);
xs = slicea(x,xdex);
ys = slicea(y,ydex);
z->data[i] = (reduce)(f,(binop)(xs,g,ys));
}
free(xs);
free(ys);
free(xdims);
free(ydims);
return z;
}
This function also uses the functions cast which presents a varargs interface to casta.
/* create an array header to access existing data in multidimensional layout */
arr cast(int *data, int rank, ...){
va_list ap;
int dims[rank];
va_start(ap,rank);
loaddimsv(rank,dims,ap);
va_end(ap);
return casta(data, rank, dims);
}
And it also uses vector_index to convert a 1D index into an nD vector of indices.
/* compute vector index list for ravel index ind */
int *vector_index(int ind, int *dims, int n, int *vec){
int i,t=ind, *z=vec;
for (i=0; i<n; i++){
z[n-1-i] = t % dims[n-1-i];
t /= dims[n-1-i];
}
return z;
}
github file. Additional supporting functions are also in the github file.
This Q/A pair is part of a series of related issues which arose in implementing inca an interpreter for the APL language written in C. Others: How can I work with dynamically-allocated arbitrary-dimensional arrays? , and How to pass a C math operator (+-*/%) into a function result=mathfunc(x,+,y);? . Some of this material was previously posted to comp.lang.c. More background in comp.lang.apl.
I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}