How save memory for a solving a symmetric (or upper traingular) matrix? - arrays

I need to solve system of linear algebraic equations A.X = B
The matrix A is double precision with about size of 33000x33000 and I will get an error when I try to allocate it:
Cannot allocate array - overflow on array size calculation.
Since I am using LAPACK dposv with the Intel MKL library, I was wondering if there is a way to somehow pass an smaller matrix to the library function? (because only half of the matrix arrays are needed to solve)
The dposv function only needs an upper or lower triangular matrix for A. Here is more details about dposv.
Update: Please notice that the A matrix is N x N and yet it takes lda: INTEGER as The leading dimension of a; lda ≥ max(1, n). So may be there is a way to parse A as an 1D array?

As the error says (Cannot allocate array - overflow on array size calculation) Your problem seems to be somewhere else: especially the limit of the integer type used to compute the array size internally. And I am afraid that you might not be able to solve that even if you add more memory. You will need to check the internals of the library that your are using for memory management (Possibly MKL, but I don't use MKL so I can not help) or choose another one.
Explanation, some functions use 4 bytes integer to compute the memory size when allocating. That gives you a limit of 2^32 or 4 Gbytes of memory that you can allocate wich is way lower than your 8 Gbytes array. In that I am assuming unsigned integer; with signed integer, that limit is 2 Gbytes.
Hints if you have limited memory:
If you do not have enough memory (about 4 Gbytes for the matrix alone since it is triangular) and you do not know the structure of the matrix, then forget about special solvers and solve your problem yourself. Solving a system with an upper triangular matrix is a backward substitution. Starting with the last row of the solution, you need only one row of the matrix to compute each component of the solution.
Find a way to load your matrix row by row starting with the last row.

Thanks to mecej4
There are several options to pass a huge matrix using less memory:
Using functions that support Matrix Storage Schemes e.g. ?pbsv
Using PARDISO

Related

C99 recursive matrix multiplication. How to acces indices?

I'm trying to implement a version of following algorithm:
But instead of reducing the problem to 1 multiplication I want to reduce the matrices dimensions to 16*16 and multiply these Blocks.
I already have an implementation for a blocksize of 1*1 (same as the algorithm mentioned before). I need to expand my version for bigger blocks(e.g. 16*16).
Multiplying blocks isn't hard. My problem is accessing the indices of my blocks, since I have to use the recursive call of my 1*1-block version:
recursive_mult(n/2 , stride , &A[0+0*stride] , &B[0+0*stride] , &C[0+0*stride]) ;
...where n is the size of the current blocks. And Im using a stride in the matrices for perfomance-testing.
So I only pass on references of my matrices.
Is there a way to work around this, without changing the recursive call to pass on start-indices of blocks?

How to find the kth smallest element of a list without sorting the list?

I need to find the median of an array without sorting or copying the array.
The array is stored in the shared memory of a cuda program. Copying it to global memory would slow the program down and there is not enough space in shared memory to make an additional copy of it there.
I could use two 'for' loops and iterate over every possible value and count how many values are smaller than it but this would be O(n^2). Not ideal
Does anybody now of a O(n) or O(nlogn) algorithm which solves my problem?
Thanks.
If your input are integers with absolute value smaller than C, there's a simple O(n log C) algorithm that needs only constant additional memory: Just binary search for the answer, i.e. find the smallest number x such that x is larger than or equal to at least k elements in the array. It's easily parallelizable too via a parallel prefix scan to do the counting.
Your time and especially memory constraints make this problem difficult. It becomes easy, however, if you're able to use an approximate median.
Say an element y is an ε approximate median if
m/2 − ε m < rank(y) < m/2 + ε m
Then all you need to do is sample
t = 7ε−2
log(2δ
−1
)
elements, and find their median any way you want.
Note that the number of samples you need is independent of your array's size - it is just a function of ε and δ.

Eigen Sparse Matrix

I am trying to multiply two large sparse matrices of size 300k * 1000k and 1000k*300k using Eigen. The matrices are highly sparse ~0.01% non zero entries, however there's no block or other structure in their sparsity.
It turns out that Eigen chokes and ends up taking 55-60G of memory. Actually, it makes the final matrix dense which explains why it takes so much memory.
I have tried multiplying matrices of similar sizes when one of the matrix is diagonal and the multiplication works fine, with ~2-3 G of memory.
Any thoughts on whats going wrong?
Even though your matrices are sparse, the result might be completely dense. You can try to remove smallest entries with (A*B).prune(ref,eps); where ref is a reference value for what is not a zero and eps is a tolerance value. Basically, all entries smaller than ref*eps will be removed during the computation of the product, thus reducing both the memory usage and size of the result. A better option would be to find a way to avoid performing this product.

very huge matrix in C programming

Good day everyone,
I'm new in C programming and I don't have a lot of knowledge on how to handle very huge matrix in C. e.g. Matrix size of 30.000 x 30.000.
My first approach is to store dynamically memory:
int main()
{ int **mat;
int j;
mat = (int **)malloc(R*sizeof(int*));
for(j=0;j<R;j++)
mat[j]=(int*)malloc(P*sizeof(int));
}
And it is a good idea to handle +/- matrix of 8.000 x 8.000. But, not bigger. So, I want to ask for any light to handle this kind of huge matrix, please.
As I said before: I am new to C, so please don't expect too much experience.
Thanks in advance for any suggestion,
David Alejandro.
PD: My laptop conf is linux ubuntu, 64bit, i7, and 4gb of ram.
For a matrix as large as that, I would try to avoid all those calls to malloc. This will reduce the time to set up the datastructure and remove the memory overhead with dynamic memory (malloc stores additional information as to the size of the chunk)
Just use malloc once - i.e:
#include <stdlib.h>
int *matrix = malloc(R * P * sizeof(int));
Then to compute the index as
index = column + row * P;
Also access the memory sequentially i.e. by column first. Better performance for the cache.
Well, a two-dimensional array (roughly analogous C representation of a matrix) of 30000 * 30000 ints, assuming 4 bytes per int, would occupy 3.6 * 10^9 bytes, or ~3.35 gigabytes. No conventional system is going to allow you to allocate that much static virtual memory at compile time, and I'm not certain you could successfully allocate it dynamically with malloc() either. If you only need to represent a small numerical range, then you could drastically (i.e., by a factor of 4) reduce your program's memory consumption by using char. If you need to do something like, e.g., assign boolean values to specific numbers corresponding to the indices of the array, you could perhaps use bitsets and further curtail your memory consumption (by a factor of 32). Otherwise, the only viable approach would involve working with smaller subsets of the matrix, possibly saving intermediate results to disk if necessary.
If you could elaborate on how you intend to use these massive matrices, we might be able to offer some more specific advice.
Assuming you are declaring your values as float rather than double, your array will be about 3.4 GB in size. As long as you only need one, and you have virtual memory on your Ubuntu system, I think you could just code this in the obvious way.
If you need multiple matrices this large, you might want to think about:
Putting a lot more RAM into your computer.
Renting time on a computing cluster, and using cluster-based processing to compute the values you need.
Rewriting your code to work on subsets of your data, and write each subset out to disk and free the memory before reading in the next subset.
You might want to do a Google search for "processing large data sets"
I dont know how to add comments so dropping an answer here.
1 thing tha I can think is, you are not going to get those values in running program. Those will come from some files only. So instead taking all values, keep reading 30,000x2 one by one so that will not come into memory.
For 30k*30k matrix, if init value is 0(or same) for all elements what you can do is, instead creating the whole matrix, create a matrix of 60k*3 (3 cols will be : row no, col no and value). This is becasue you will have max 60k different location which will be affected.
I know this is going to be a little slow because you always need to see if the element is already added or not. So, if speed is not your concern, this will work.

Flexible array size in C

I am coding an MCMC algorithm in C and I have a little problem. The idea of this algorithms is to make inferences for the number of groups in a population. So let us say that we start with k groups. Where the first value for k is given by the user or randomly selected. Now at each step of the algorithm k can decrease by 1, increase by 1 or stay the same. And I have some variables for each group;
double *mu;
double *lambda;
double **A
mu and lambda are indeed arrays of k elements and A is a two dimensional array of kxN. N as well changes at each iteration. I have some data y1, y2,..., yn so at each iteration I do some process, propose new values for the parameters and decide if to move k or not.
So far I have tied to use malloc and realloc to deal with all this changes of the dimension of my parameters but I have to iterate this algorithm for let us say 100,000 times so at certain point it crashes. If I start with k=10 in my case at the third iteration!
So two questions:
Can I use realloc at each iteration? or this is my big mistake. If yes well I imagine that should check my code!
If not what should I do, any suggestion?
I would consider not changing your storage on every iteration. realloc carries considerable overhead (in the worst-case, it has to copy your entire array every single time).
Can you simply allocate for the maximum dimensions at startup, and then just use less of it? Or at the very least, only realloc on an increase in storage requirements by doubling your capacity (thus mimicking how a std::vector operates).
[By the way, I don't know why your application crashes, as you haven't given us any details (e.g. the error message you get, or what you've found by debugging. But I guess you have a bug somewhere!]

Resources