Efficient gather (of whole rows) from a large matrix - c

I am trying to perform a simple operation. I have a matrix that is A x B by size. I have a list of indices of length C, and I want to make a C x B matrix by collecting rows from the first matrix according to the indices. i.e. index i tells me which row from the first matrix I put into row i in the second matrix.
I presorted the indices so the algorithm is input stationary: I load in the row from the A x B matrix and write that row to all the rows in the C x B matrix.
The code looks something like this:
for(int i = 0;i < A; i ++)
{
for(int k = offsets[i]; k < offsets[i+1]; k ++)
{
int dest = index1[k];
for(int j = 0;j < C/ 8; j++)
{
__m256 a = _mm256_load_ps(&input[i * C + j * 8]);
_mm256_store_ps(&output[dest * C + j * 8] ,a);
}
}
}
The code is entirely bottlenecked by write to memory.
This code is efficient when C is small. However it scales very poorly when C increases, which I surmise is due to cache behavior. (It takes 10x time when C = 1024 compared to C = 256).
I tried blocking in the C dimension:
for(int c = 0; c < C; c+= K){
for(int i = 0;i < A; i ++)
{
for(int k = offsets[i]; k < offsets[i+1]; k ++)
{
int dest = index1[k];
for(int j = 0;j < C/ 8 / K; j++)
{
__m256 a = _mm256_load_ps(&input[i * C + c + j * 8]);
_mm256_store_ps(&output[dest * C + c + j * 8] ,a);
}
}
}
}
This actually slows down the code more.
Any suggestions?

It seems the inner loop is a mere streamed copy operation. Cache wouldn't matter in such a case. Rather try using simple memcpy() instead so the compiler can yield better execution code, hopefully.
//for(int j = 0;j < C/ 8; j++)
//{
// __m256 a = _mm256_load_ps(&input[i * C + j * 8]);
// _mm256_store_ps(&output[dest * C + j * 8] ,a);
//}
memcpy(&output[dest * C], &input[i * C], C * sizeof(float));
Appendix
If satisfiable results won't be obtained, in the last resort, take C++ and replace the outer loop with parllel_for(). Then it may be possible to make the cache(or otherwise pipeline?) work a little bit better.
parallel_for(0, A, [&](const int i) {
for(int k = offsets[i]; k < offsets[i+1]; k++)
{
int dest = index1[k];
memcpy(&output[dest * C], &input[i * C], C * sizeof(float));
}
});

Related

How do I multiply two dynamic matrices in C?

I'm trying to multiply two dynamic matrices by passing them through a function. I'm getting a segmentation fault during the multiplication.
The matrices are being passed through a function. The items in the arguments are correct because I had to use them for a different operation in this project. I have a feeling that I messed up with the pointers, but i'm pretty new to C and i'm not sure where I messed up.
double** multiplyMatrices(
double** a,
const uint32_t a_rows,
const uint32_t a_cols,
double** b,
const uint32_t b_cols){
uint32_t i = 0;
uint32_t j = 0;
uint32_t k = 0;
double** c;
//allocate memory to matrix c
c = (double **)malloc(sizeof(double *) * a_rows);
for (i = 0; i < a_rows; i++) {
*(c +i) = (double *)malloc(sizeof(double) * b_cols);
}
//clear matrix c
for(i = 0; i < a_rows; i++){
for(j = 0; j < a_cols; j++){
*c[j] = 0;
}
}
i = 0;
//multiplication
while(j = 0, i < a_rows ){
while(k = 0, j < b_cols){
while(k < a_cols){
//following line is where i'm getting the segmentation fault
*(*(c+(i*b_cols))+j) += (*(*(a+(i*a_cols))+k)) * (*(*(b+(k*b_cols))+j));
k++;
}
j++;
}
i++;
}
return c;
}
The obvious mistake is that you dereference c + i * b_cols while c is an array of pointers of size a_rows. So likely c + i * b_cols is outside of the area that you previously allocated with malloc().
I would suggest to simplify the matrix representation using a single array of double with the size equal to the total number of elements, i.e. rows * cols.
For example:
double *c;
c = malloc(sizeof(double) * a_rows * b_cols);
This not only has better overall performance, but simplifies the code. You would then have to "linearise" the offset inside your unidimensional array to convert from bi-dimensional matrix coordinates. For example:
c[i * b_cols + j] = ...
Of course, the other two matrices need to be allocated, filled and accessed in a similar manner.
For code clarity, I would also replace the while statements by for statements with the actual variable that they loop on. For example:
for (i = 0; i < a_rows; i++)
for (j = 0; j < b_cols; j++)
for (k = 0; k < a_cols; k++)
You can (ab)use the C language in many ways, but the trick is to make it more clear for you in the first place.

Cumulative sum in C is blowing up to infinity

I'm generally an R user but I am trying to use to C for some lower level cumulative sums and multiplications.
I am trying to generate a cumulative sum of eta and storing the result in tmp0. However, when I output tmp0 it either gives me Inf, NaN, or some arbitrarily large number. I double checked the same cumulative sum in R and it works fine; I am not sure why C is not handling it. Below is the code that I am using:
int i,j;
const int p = ncov, n = nin;
double accNum0[n]; //accumulate first part of likelihood sum eta_i
double accNum1[n]; //accumulate the backwards numerator
double accNum2[n]; //acumulate the forward numerator (weighted)
double tmp0 = 0;
double eta[n]; //calculate linear predictor in this step (X %*% beta)
for(i = 0; i < n; i++) {
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}
for (i = 0; i < n; ++i) {
tmp0 += eta[i];
}
return (tmp0);
Again, I am fairly new to C so I may be making some rookie mistakes and would greatly appreciate any (and all) suggestions!
There might be errors with how you are initializing b or x. However, one definite error is that eta is being used uninitialized. This means eta[i] may begin with some arbitrary value instead of 0 as you are likely expecting.
Add an initialization before accumulating into it.
for(i = 0; i < n; i++) {
eta[i] = 0;
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}

How to allocate triangular array using single malloc in C?

I am trying to allocate triangular array using single malloc but I could'nt find any solution for this. My structure is something like this :
a - - - -
b c - - -
d e f - -
g h i j -
k l m n o
I've made it using two malloc.
How are you planning to use the structure — what code would you write to access an array element? Also, what size of array are you dealing with?
If the array is small enough (say less than 100x100, but the boundary value is negotiable) then it makes sense to use a regular rectangular array and access that as usual, accepting that some of the allocated space is unused. If the array will be large enough that the unused space will be problematic, then you have to work harder.
Do you plan to use lt_matrix[r][c] notation, or could you use a 1D array lt_matrix[x] where x is calculated from r and c? If you can use the 1D notation, then you can use a single allocation — as shown in Technique 1 in the code below. If you use the double-subscript notation, you should probably do two memory allocations — as shown in Technique 2 in the code below. If you don't mind living dangerously, you can mix things up with Technique 3, but it isn't recommended that you use it unless you can determine what the limitations and issues are and assess for yourself whether it is safe enough for you to use. (If you ask me, the answer's "No; don't use it", but that could be regarded as being over-abundantly cautious.)
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
static inline int lt_index(int r, int c) { assert(r >= c); return r * (r + 1) / 2 + c; }
int main(void)
{
int matrixsize = 5;
/* Technique 1 */
char *lt_matrix1 = malloc(matrixsize * (matrixsize + 1) / 2 * sizeof(*lt_matrix1));
assert(lt_matrix1 != 0); // Appalling error checking
char value = 'a';
for (int i = 0; i < matrixsize; i++)
{
for (int j = 0; j <= i; j++)
lt_matrix1[lt_index(i, j)] = value++;
}
for (int i = 0; i < matrixsize; i++)
{
int j;
for (j = 0; j <= i; j++)
printf("%-3c", lt_matrix1[lt_index(i, j)]);
for (; j < matrixsize; j++)
printf("%-3c", '-');
putchar('\n');
}
free(lt_matrix1);
/* Technique 2 */
char **lt_matrix2 = malloc(matrixsize * sizeof(*lt_matrix2));
assert(lt_matrix2 != 0); // Appalling error checking
char *lt_data2 = malloc(matrixsize * (matrixsize + 1) / 2 * sizeof(*lt_matrix1));
assert(lt_data2 != 0); // Appalling error checking
for (int i = 0; i < matrixsize; i++)
lt_matrix2[i] = &lt_data2[lt_index(i, 0)];
value = 'A';
for (int i = 0; i < matrixsize; i++)
{
for (int j = 0; j <= i; j++)
lt_matrix2[i][j] = value++;
}
for (int i = 0; i < matrixsize; i++)
{
int j;
for (j = 0; j <= i; j++)
printf("%-3c", lt_matrix2[i][j]);
for (; j < matrixsize; j++)
printf("%-3c", '-');
putchar('\n');
}
free(lt_data2);
free(lt_matrix2);
/* Technique 3 - do not use this */
void *lt_data3 = malloc(matrixsize * sizeof(int *) + matrixsize * (matrixsize + 1) / 2 * sizeof(int));
assert(lt_data3 != 0); // Appalling error checking
int **lt_matrix3 = lt_data3;
int *lt_base3 = (int *)((char *)lt_data3 + matrixsize * sizeof(int *));
for (int i = 0; i < matrixsize; i++)
lt_matrix3[i] = &lt_base3[lt_index(i, 0)];
value = 1;
for (int i = 0; i < matrixsize; i++)
{
for (int j = 0; j <= i; j++)
lt_matrix3[i][j] = value++;
}
for (int i = 0; i < matrixsize; i++)
{
int j;
for (j = 0; j <= i; j++)
printf("%-3d", lt_matrix3[i][j]);
for (; j < matrixsize; j++)
printf("%-3c", '-');
putchar('\n');
}
free(lt_data3);
return 0;
}
The output from the program is:
a - - - -
b c - - -
d e f - -
g h i j -
k l m n o
A - - - -
B C - - -
D E F - -
G H I J -
K L M N O
1 - - - -
2 3 - - -
4 5 6 - -
7 8 9 10 -
11 12 13 14 15
Valgrind version 3.13.0.SVN (revision 16398) gives this a clean bill of health on macOS Sierra 10.12.5 using GCC 7.1.0.
You can just malloc(width * height * sizeof(Object)) if you want to use one malloc and create one continuous array. If you want to access the (x, y) position, use: array[y * width + x].
Using two malloc just creates an array of pointers, which is a little different from a continuous array acting like a 2D array.

C - stack smashing detected

I need to implement a pretty easy in-place LU-decomposition of matrix A. I'm using Gaussian elimination and I want to test it with a 3x3 matrix. The problem is, I keep getting stack smashing error and I don't have any idea why. I don't see any problems in my code, which could do this. Do you have any idea?
The problem is probably in the Factorization block.
###My code:###
#include <stdio.h>
int main() {
int n = 3; // matrix size
int A[3][3] = {
{1, 4, 7},
{2, 5, 8},
{3, 6, 10}
};
printf("Matrix A:\n");
for( int i=0; i < n; i++ ) {
for( int j=0; j < n; j++ ) {
printf("%d ", A[i][j]);
if ( j % 2 == 0 && j != 0 ) {
printf("\n");
}
}
}
// FACTORIZATION
int k;
int rows;
for( k = 0; k < n; k++ ) {
rows = k + k+1;
A[rows][k] = A[rows][k]/A[k][k];
A[rows][rows] = A[rows][rows] - A[rows][k] * A[k][rows];
printf("k: %d\n", k);
}
printf("Matrix after decomp:\n");
for( int i=0; i < n; i++ ) {
for( int j=0; j < n; j++ ) {
printf("%d ", A[i][j]);
if ( j % 3 == 0 && j != 0 ) {
printf("\n");
}
}
}
return 0;
}
Your error is most likely here:
rows = k + k+1;
A[rows][k] = A[rows][k]/A[k][k];
A[rows][rows] = A[rows][rows] - A[rows][k] * A[k][rows];
This means that rows goes through the values 1, 3, 5; and is then used to access an array with only three elements. That would, indeed, overflow, as the only valid offset among those is 1.
EDIT: Looking at your Matlab code, it is doing something completely different, as rows = k + 1:n sets rows to a small vector, which it then uses the splice the matrix, something C does not support as a primitive. You would need to reimplement both that and the matrix multiplication A(rows, k) * A(k, rows) using explicit loops.
Your original Matlab code was (Matlab has 1-based indexing):
for k = 1:n - 1
rows = k + 1:n
A(rows, k) = A(rows, k) / A(k, k)
A(rows, rows) = A(rows, rows) - A(rows, k) * A(k, rows)
end
What rows = k + 1:n this does is that it sets rows to represent a range. The expression A(rows, k) is actually a reference to a vector-shaped slice of the matrix, and Matlab can divide a vector by a scalar.
On the last line, A(rows, rows) is a matrix-shaped slice , and A(rows, k) * A(k, rows) is a matrix multiplication, e.g. multiplying matrices of dimension (1,3) and (3,1) to get one of (3,3).
In C you can't do that using the builtin = and / operators.
The C equivalent is:
for ( int k = 0; k < n - 1; ++k )
{
// A(rows, k) = A(rows, k) / A(k, k)
for ( int row = k + 1; row < n; ++row )
A[row][k] /= A[k][k];
// A(rows, rows) = A(rows, rows) - A(rows, k) * A(k, rows)
for ( int row = k + 1; row < n; ++row )
for ( int col = k + 1; col < n; ++col )
A[row][col] -= A[row][k] * A[k][col];
}
(disclaimer: untested!)
The first part is straightforward: every value in a vector is being divided by a scalar.
However, the second line is more complicated. The Matlab code includes a matrix multiplication and a matrix subtraction ; and also the operation of extracting a sub-matrix from a matrix. If we tried to write a direct translation of that to C, it is very complicated.
We need to use two nested loops to iterate over the rows and columns to perform this operation on the square matrix.

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying... How to fix?

Apparently, transposing a matrix then multiplying it is faster than just multiplying the two matrices. However, my code right now does not do that and I have no clue why... (The normal multiplying is just the triple-nested-for loop and it gives me roughly 1.12secs to multiply a 1000x1000 matrix whilst this code gives me 8 times the time(so slower instead of faster)... I am lost now any help would be appreciated! :D
A = malloc (size*size * sizeof (double));
B = malloc (size*size * sizeof (double));
C = malloc (size*size * sizeof (double));
/* initialise array elements */
for (row = 0; row < size; row++){
for (col = 0; col < size; col++){
A[size * row + col] = rand();
B[size * row + col] = rand();
}
}
t1 = getTime();
/* code to be measured goes here */
T = malloc (size*size * sizeof(double));
for(i = 0; i < size; ++i) {
for(j = 0; j <= i ; ++j) {
T[size * i + j] = B[size * j + i];
}
}
for (j = 0; j < size; ++j) {
for (k = 0; k < size; ++k) {
for (m = 0; m < size; ++m) {
C[size * j + k] = A[size * j + k] * T[size * m + k];
}
}
}
t2 = getTime();
I see couple of problems.
You are just setting the value of C[size * j + k] instead of incrementing it. Even though this is an error in the computation, it shouldn't impact performance. Also, you need to initialize C[size * j + k] to 0.0 before the innermost loop starts. Otherwise, you will be incrementing an uninitialized value. That is a serious problem that could result in overflow.
The multiplication term is wrong.
Remember that your multiplication term needs to represent:
C[j, k] += A[j, m] * B[m, k], which is
C[j, k] += A[j, m] * T[k, m]
Instead of
C[size * j + k] = A[size * j + k] * T[size * m + k];
you need
C[size * j + k] += A[size * j + m] * T[size * k + m];
// ^ ^ ^^^^^^^^^^^^^^^^
// | | Need to get T[k, m], not T[m, k]
// | ^^^^^^^^^^^^^^^^
// | Need to get A[j, m], not A[j, k]
// ^^^^ Increment, not set.
I think the main culprit that hurts performance, in addition to it being wrong, is your use of T[size * m + k]. When you do that, there is a lot of jumping of memory (m is the fastest changing variable in the loop) to get to the data. When you use the correct term, T[size * k + m], there will be less of that and you should see a performance improvement.
In summary, use:
for (j = 0; j < size; ++j) {
for (k = 0; k < size; ++k) {
C[size * j + k] = 0.0;
for (m = 0; m < size; ++m) {
C[size * j + k] += A[size * j + m] * T[size * k + m];
}
}
}
You might be able to get a little bit more performance by using:
double* a = NULL;
double* c = NULL;
double* t = NULL;
for (j = 0; j < size; ++j) {
a = A + (size*j);
c = C + (size*j);
for (k = 0; k < size; ++k) {
t = T + size*k;
c[k] = 0.0;
for (m = 0; m < size; ++m) {
c[k] += a[m] * t[m];
}
}
}
PS I haven't tested the code. Just giving you some ideas.
It is likely that your transpose runs slower than the multiplication in this test because the transpose is where the data is loaded from memory into cache, while the matrix multiplication runs out of cache, at least for 1000x1000 with many modern processors (24 MB fits into cache on many Intel Xeon processors).
In any case, both your transpose and multiplication are horribly inefficient. Your transpose is going to thrash the TLB, so you should use a blocking factor of 32 or so (see https://github.com/ParRes/Kernels/blob/master/SERIAL/Transpose/transpose.c for example code).
Furthermore, on x86, it is better to write contiguously (due to how cache-line locking and blocking stores work - if you use nontemporal stores carefully, this might change), whereas on some variants of PowerPC, in particular ,the Blue Gene variants, you want to read contiguously (because of in-order execution, nonblocking stores and write-through cache). See https://github.com/jeffhammond/HPCInfo/blob/master/tuning/transpose/transpose.c for example code.
Finally, I don't care what you say ("I specifically have to do it this way though"), you need to use BLAS for the matrix multiplication. End of story. If your supervisor or some other coworker is telling you otherwise, they are incompetent and should not be allowed to talk about code until they have been thoroughly reeducated. Please refer them to this post if you don't feel like telling them yourself.

Resources