matrix optimization - segmentation fault when using intrinsics and loop unrolling - c

I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:
const int UNROLL = 4;
void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < vec2->dim.rows; j++) {
__m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
for (int x = 0; x<UNROLL; x++) {
tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2);
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < mat2->dim.cols; j++) {
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int k = 0; k < mat1->dim.cols; k++) {
__m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
}
}
for (int x = 0; x < UNROLL; x++) {
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
edited:
Here is the struct of matrix. I didn't modified it.
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
edited:
I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?

In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:
for (int k = 0; k < mat1->dim.cols; k++) {
for (int x = 0; x < UNROLL; x++) {
...
_mm256_load_ps(&mat1->data[i+x*8][k])
}
}
_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.
If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)
You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.
Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.
But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.
You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.
gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).
Related:
How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.

Related

Is there a point in transforming Sparse Matrix Multiplication into block form?

In an assignment for a parallel computing class we have been assigned to program Sparse Binary Matrix-Matrix multiplication (SpGEMM) in C. Julia has a relatively easy to follow implementation based on Gustavson's algorithm that works great.
Thing is we also need to do the multiplication in block form, which I already did, but I don't really see any speedup in doing so. From what I understand you're supposed to use
the result of A(i,k)*B(k,j), where (i,j) are coordinates in the block matrix, as a mask/filter for the next block multiplication in the sum C(i,j) = Σ( A(i,k)*B(k,j) ).
Julia's implementation though, which I followed, already has a dense boolean array when computing each row that acts as a "flag" for when not to add something again in the resulting matrix.
My question is, is there any merit in turning this into block matrix multiplication or is there something that I might be doing wrong myself.
Keep in mind my C code currently runs in half the time Matlab takes in multiplying a 5,000,000 x 5,000,000 sparse matrix. The blocked version, which I really tried to optimize and I'm also doing in the Gustavson order, gets slower and slower the smaller the block-size is set.
Here is my current code
//C=D+(A*B) (basically OR)
bool SpGEMM_dor(int *Acol, int *Arow, int An,
int *Bcol, int *Brow, int Bm,
int **Ccol, int *Crow, int *Csize,//output
int *Dcol, int *Drow)//previous
{
//printCSR(Arow,Acol,An,An,An);
int nnzcum=0;
bool *xb = calloc(An,sizeof(bool)); //boolean flag
for(int i=0; i<An; i++){
int nnzpv = nnzcum;//nnz of previous row;
Crow[i] = nnzcum;
if(nnzcum + An > *Csize){ //make sure theres enough space
*Csize += MAX(An, *Csize/4);
*Ccol = realloc(*Ccol,*Csize*sizeof(int));
}
//---OR---
//add previous row items in order to exist in the next block
for(int jj=Drow[i]; jj<Drow[i+1]; jj++){
int j = Dcol[jj];
xb[j] = true;
(*Ccol)[nnzcum] = j;
nnzcum++;
}
//--------
//add new row items
for(int jj=Arow[i]; jj<Arow[i+1]; jj++){
int j = Acol[jj];
for(int kp=Brow[j]; kp<Brow[j+1]; kp++){
int k = Bcol[kp];
if(!xb[k]){
xb[k] = true;
(*Ccol)[nnzcum] = k;
nnzcum++;
}
}
}
if(nnzcum > nnzpv){
quickSort(*Ccol,nnzpv,nnzcum-1);
for(int p=nnzpv; p<nnzcum; p++){
xb[ (*Ccol)[p] ] = false;
}
}
}
Crow[An] = nnzcum;
free(xb);
return Crow[An];
}
The part of code that I have inside of the ----OR---- section only happens in the block version in order to add the previous block to the now-calculating one. It basically does C = D+(A*B). I've also tried calculating the next block and then merging the 2 sorted arrays of each row of the 2 CSR matrices, which seems to be slower. Also all matrices are in CSR format.

Matrix-Multiplication: Why non-blocked outperforms blocked?

I'm trying to speed up a matrix multiplication algorithm by blocking the loops to improve cache performance, yet the non-blocked version remains significantly faster regardless of matrix size, block size (I've tried lots of values between 2 and 200, potenses of 2 and others) and optimization level.
Non-blocked version:
for(size_t i = 0; i < n; ++i)
{
for(size_t k = 0; k < n; ++k)
{
int r = a[i][k];
for(size_t j = 0; j < n; ++j)
{
c[i][j] += r * b[k][j];
}
}
}
Blocked version:
for(size_t kk = 0; kk < n; kk += BLOCK)
{
for(size_t jj = 0; jj < n; jj += BLOCK)
{
for(size_t i = 0; i < n; ++i)
{
for(size_t k = kk; k < kk + BLOCK; ++k)
{
int r = a[i][k];
for(size_t j = jj; j < jj + BLOCK; ++j)
{
c[i][j] += r * b[k][j];
}
}
}
}
}
I also have a bijk version and a 6-loops bikj version but they all gets outperformed by the non-blocked version and I don't get why this happens. Every paper and tutorial that I've come across seems to indicate that the the blocked version should be significantly faster. I'm running this on a Core i5 if that matters.
Try blocking in one dimension only, not in both dimensions.
Matrix multiplication exhaustively processes elements from both matrices. Each row vector on the left matrix is repeatedly processed, taken into successive columns of the right matrix.
If the matrices do not both fit into the cache, some data will invariably end up loaded multiple times.
What we can do is break up the operation so that we work with about a cache-sized amount of data at one time. We want the row vector from the left operand to be cached, since it is repeatedly applied against multiple columns. But we should only take enough columns (at a time) to stay within the limit of the cache. For instance, if we can only take 25% of the columns, it means we will have to pass over the row vectors four times. We end up loading the left matrix from memory four times, and the right matrix only once.
(If anything is to be loaded more than once, it should be the row vectors on the left, because they are flat in memory, which benefits from burst loading. Many cache architectures can perform a burst load from memory into adjacent cache lines faster than random access loads. If the right matrix were stored in column-major order, that would be even better: then we are doing cross-products between flat arrays, which prefetch into memory nicely.)
Let's also not forget the output matrix. The output matrix occupies space in the cache also.
I suspect one flaw in the 2D blocked approach is that each element of the output matrix depends on two inputs: its entire entire row in the left matrix, and the entire column in the right matrix. If the matrices are visited in blocks, that means that each target element is visited multiple times to accumulate the partial result.
If we do a complete row-column dot product, we don't have to visit the c[i][j] more than once; once we take column j into row i, we are done with that c[i][j].

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

Determining stride of arrays

I am trying to perform LoopDependenceAnalysis on arrays in LLVM. For this I have written a LLVM LoopPass. I am able to detect the arrays using GetElementPtr.I am unable to determine the stride of an array used in a loop.
For example, I have a c code
int b[10];
for(int i = 0; i < 10; i++)
{
b[i] = b[i+2];
}
Now in both the array accesses, the first array access (b[i] ) has a stride of 0 while the second one has a stride of 2. How is it possible to determine those values ?

alternative to multidimensional array in c

tI have the following code:
#define FIRST_COUNT 100
#define X_COUNT 250
#define Y_COUNT 310
#define z_COUNT 40
struct s_tsp {
short abc[FIRST_COUNT][X_COUNT][Y_COUNT][Z_COUNT];
};
struct s_tsp xyz;
I need to run through the data like this:
for (int i = 0; i < FIRST_COUNT; ++i)
for (int j = 0; j < X_COUNT; ++j)
for (int k = 0; k < Y_COUNT; ++k)
for (int n = 0; n < Z_COUNT; ++n)
doSomething(xyz, i, j, k, n);
I've tried to think of a more elegant, less brain-dead approach. ( I know that this sort of multidimensional array is inefficient in terms of cpu usage, but that is irrelevant in this case.) Is there a better approach to the way I've structured things here?
If you need a 4D array, then that's what you need. It's possible to 'flatten' it into a single dimensional malloc()ed 'array', however that is not quite as clean:
abc = malloc(sizeof(short)*FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT);
Accesses are also more difficult:
*(abc + FIRST_COUNT*X_COUNT*Y_COUNT*i + FIRST_COUNT*X_COUNT*j + FIRST_COUNT*k + n)
So that's obviously a bit of a pain.
But you do have the advantage that if you need to simply iterate over every single element, you can do:
for (int i = 0; i < FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT; i++) {
doWhateverWith *(abc+i);
}
Clearly this method is terribly ugly for most uses, and is a bit neater for one type of access. It's also a bit more memory-conservative and only requires one pointer-dereference rather than 4.
NOTE: The intention of the examples used in this post are just to explain the concepts. So the examples may be incomplete, may lack error handling, etc.
When it comes to usage of multi-dimension array in C, the following are the two possible ways.
Flattening of Arrays
In C, arrays are implemented as a contiguous memory block. This information can be used to manipulate the values stored in the array and allows rapid access to a particular array location.
For example,
int arr[10][10];
int *ptr = (int *)arr ;
ptr[11] = 10;
// this is equivalent to arr[1][0] = 10; assign a 2D array
// and manipulate now as a single dimensional array.
The technique of exploiting the contiguous nature of arrays is known as flattening of arrays.
Ragged Arrays
Now, consider the following example.
char **list;
list[0] = "United States of America";
list[1] = "India";
list[2] = "United Kingdom";
for(int i=0; i< 3 ;i++)
printf(" %d ",strlen(list[i]));
// prints 24 5 14
This type of implementation is known as ragged array, and is useful in places where the strings of variable size are used. Popular method is to have dynamic-memory-allocation to be done on the every dimension.
NOTE: The command line argument (char *argv[]) is passed only as ragged array.
Comparing flattened and ragged arrays
Now, lets consider the following code snippet which compares the flattened and ragged arrays.
/* Note: lacks error handling */
int flattened[30][20][10];
int ***ragged;
int i,j,numElements=0,numPointers=1;
ragged = (int ***) malloc(sizeof(int **) * 30);
numPointers += 30;
for( i=0; i<30; i++) {
ragged[i] = (int **)malloc(sizeof(int*) * 20);
numPointers += 20;
for(j=0; j<20; j++) {
ragged[i][j]=(int*)malloc(sizeof(int) * 10);
numElements += 10;
}
}
printf("Number of elements = %d",numElements);
printf("Number of pointers = %d",numPointers);
// it prints
// Number of elements = 6000
// Number of pointers = 631
From the above example, the ragged arrays require 631-pointers, in other words, 631 * sizeof(int *) extra memory locations for pointing 6000 integers. Whereas, the flattened array requires only one base pointer: i.e. the name of the array enough to point to the contiguous 6000 memory locations.
But OTOH, the ragged arrays are flexible. In cases where the exact number of memory locations required is not known you cannot have the luxury of allocating the memory for worst possible case. Again, in some cases the exact number of memory space required is known only at run-time. In such situations ragged arrays become handy.
Row-major and column-major of Arrays
C follows row-major ordering for multi-dimensional arrays. Flattening of arrays can be viewed as an effect due this aspect in C. The significance of row-major order of C is it fits to the natural way in which most of the accessing is made in the programming. For example, lets look at an example for traversing a N * M 2D matrix,
for(i=0; i<N; i++) {
for(j=0; j<M; j++)
printf(“%d ”, matrix[i][j]);
printf("\n");
}
Each row in the matrix is accessed one by one, by varying the column rapidly. The C array is arranged in memory in this natural way. On contrary, consider the following example,
for(i=0; i<M; i++) {
for(j=0; j<N; j++)
printf(“%d ”, matrix[j][i]);
printf("\n");
}
This changes the column index most frequently than the row index. And because of this there is a lot of difference in efficiency between these two code snippet. Yes, the first one is more efficient than the second one!
Because the first one accesses the array in the natural order (row-major order) of C, hence it is faster, whereas the second one takes more time to jump. The difference in performance would get widen as the number of dimensions and the size of element increases.
So when working with multi-dimension arrays in C, its good to consider the above details!

Resources