Finding eigenvectors and eigenvalues of a sparse matrix with ARPACK ( called form PYTHON, MATLAB or as a FORTRAN subroutine) - sparse-matrix

Few days ago I asked a question how to find the eigenvalues of a large sparse matrix. I got no answers, so I decided to describe a potential solution.
One question remains:
Can I use the python implementation of ARPACK
to compute the eigenvalues of a asymmetric sparse matrix.
At the beginning I would like to say that it is not at all necessary to call the subroutines of ARPACK directly using FOTRAN driver program. That is quite difficult and I never got it going. But one can do the following:
#
OPTION 1: Python
#
One can install numpy and scipy and run the following code:
import numpy as np
from scipy.linalg import eigh
from scipy.sparse.linalg import eigsh
from scipy.sparse import *
from scipy import *
# coordinate format storage of the matrix
# rows
ii = array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 4])
# cols.
jj = array([0, 1, 0, 1, 2, 1, 2, 3, 2, 3, 4])
# and the data
data=array([1.,-1.,-1., 2.,-2.,-2., 1., 1., 1., 1., 1.])
# now put this into sparse storage (CSR-format)
m=csr_matrix( (data,(ii,jj)), shape=(5,5) )
# you can check what you did
matrix([[ 1, -1, 0, 0, 0],
[-1, 2, -2, 0, 0],
[ 0, -2, 1, 1, 0],
[ 0, 0, 1, 1, 0],
[ 0, 0, 0, 0, 1]])
# the real part starts here
evals_large, evecs_large = eigsh(m, 4, which='LM')
# print the largest 4 eigenvalues
print evals_all
# and the values are
[-1.04948118 1. 1.48792836 3.90570354]
Well this is all very nice, specially because it spears us the joy of reading the very "well written" manual of ARPACK.
I have a problem with this, I think that it doesn't work with asymmetric matrices. At least comparing the results to matlab was not very convincing.
#
OPTION 2: MATLAB
#
% put your data in a file "matrix.dat"
% row col. data
% note that indexing starts at "1"
1 1 1.
1 2 -1.
......
load matrix.dat
M = spconvert(matrix)
[v,d] = eig(M)
% v - contains the eigenvectors
% d - contains the eigenvalues
I think that using matlab is way simpler and works for asymmetric matrices. Well I have a 500000x500000 sparse matrix, so whether this will work in matlab .... Is another cup of tea! I have to note that using python I was able to load matrix of this size and compute it's eigenvalues without too much of a trouble.
Cheers,

Related

How to organize list of list of lists to be compatible with scipy.optimize fmin init array

I am very amateur when it comes to scipy. I am trying to use scipy's fmin function on a multidimensional variable system. For the sake of simplicity I am using list of list of list's. My data is 12 dimensional, when I enter np.shape(DATA) it returns (3,2,2), I am not even sure if scipy can handle that many dimensions, if not no problem I can reduce them, the point is that the optimize.fmin() function doesn't accept list based arrays as x0 initial parameters, so I need help either rewriting the x0 array into numpy compatible one or the entire DATA array into a 12 dimensional matrix or something like that.
Here is a simpler example illustrating the issue:
from scipy import optimize
import numpy as np
def f(x): return(x[0][0]*1.5-x[0][1]*2.0+x[1][0]*2.5-x[1][1]*3.0)
result = optimize.fmin(f,[[0.1,0.1],[0.1,0.1]])
print(result)
It will give an error saying invalid index to scalar variable which probably comes from not understanding the [[],[]] list of list structure, so it probably only understands numpy array formats.
So how to rewrite this to make it work, and also for my (3,2,2) shaped list of list as well!?
scipy.optimize.fmin needs the initial guess for the function parameters to be a 1D array with a number of elements that suits the function to optimize. In your case, maybe you can use flatten and reshape if you just need the output to be in the same shape as your input parameters. An example based on your illustration code:
from scipy import optimize
import numpy as np
def f(x):
return x[0]*1.5-x[1]*2.0+x[2]*2.5-x[3]*3.0
guess = np.array([[0.1, 0.1],
[0.1, 0.1]]) # guess.shape is (2,2)
out = optimize.fmin(f, guess.flatten()) # flatten upon input
# out.shape is (4,)
# reshape output according to guess
out = out.reshape(guess.shape) # out.shape is (2,2) again
or out = optimize.fmin(f, guess.flatten()).reshape(guess.shape) in one line. Note that this also works for a 3-dimensional array as you propose:
guess = np.arange(12).reshape(3,2,2)
# array([[[ 0, 1],
# [ 2, 3]],
# [[ 4, 5],
# [ 6, 7]],
# [[ 8, 9],
# [10, 11]]])
guess = guess.flatten()
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
guess = guess.reshape(3,2,2)
# array([[[ 0, 1],
# [ 2, 3]],
# [[ 4, 5],
# [ 6, 7]],
# [[ 8, 9],
# [10, 11]]])

Creating a dataframe with vector entries

I am trying to create a pandas dataframe where the entry in a single cell is a numpy array. For example, given a list of chemical compounds - A2B3C4, D1A2J3 etc, I create a numpy array for each of them so that:
firstium - A2B3C4 - [2,3,4,0,0,0,0.....]
secondium - D1A2J3 - [2,0,0,1,......3....]
I would like to create a dataframe with just two columns - 'name' and 'vec' where name is the string for the name of the compound and vec has the array for the formula. let's say that vec is of dimension 1 x 100.
Name vec
firstium [2,3,4,0,0,0...]
secondium [2,0,0,1,.....3.]
etc.
What I have been doing so far is to create a dictinary {'name':'vec'} and converting this to a dataframe:
Min_dict={}
for ....:
..
Min_dict[min_name]=vec
Min_Dataframe=pd.DataFrame.from_dict(Min_dict,orient='index')
However, this gives me a dataframe with as many columns as the dimension of the array, plus one. So, my dataframe has dimensions data x 101. I need it to be data x 2
This makes it inconvenient to do processing on the data as I would like to treat each array as one unit of information. Does any one know how to do what I just described?
Thanks!
IIUC:
Setup
data = {
'firstium': np.array([2, 3, 4, 0, 0, 0]),
'secondium': np.array([2, 0, 0, 1, 0, 3])
}
Option 1
pd.Series(data).rename_axis('Name').reset_index(name='Vec')
Name Vec
0 firstium [2, 3, 4, 0, 0, 0]
1 secondium [2, 0, 0, 1, 0, 3]
Option 2
pd.DataFrame(dict(zip(('Name', 'Vec'), zip(*data.items()))))
Name Vec
0 firstium [2, 3, 4, 0, 0, 0]
1 secondium [2, 0, 0, 1, 0, 3]

Split an array into bins of equal numbers

I have an array (not sorted) of N elements. I'd like to keep the original order of N, but instead of the actual elements, I'd like them to have their bin numbers, where N is split into m bins of equal (if N is divisible by m) or nearly equal (N not divisible by m) values. I need a vectorized solution (since N is fairly large, so standard python methods won't be efficient). Is there anything in scipy or numpy that can do this?
e.g.
N = [0.2, 1.5, 0.3, 1.7, 0.5]
m = 2
Desired output: [0, 1, 0, 1, 0]
I've looked at numpy.histogram, but it doesn't give me unequally spaced bins.
Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted -
Here's the implementation -
def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]
Sample runs with bin-counting for each bin to verify results -
In [442]: N = np.arange(1,94)
In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])
In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])
In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])
Here's another approach using linspace to create those equally spaced numbers that could be used as indices, like so -
def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]
Sample run -
In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])
In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])
In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])
In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])
In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])
pandas.qcut
Another good alternative is the pd.qcut from pandas. For example:
In [6]: import pandas as pd
In [7]: N = [0.2, 1.5, 0.3, 1.7, 0.5]
...: m = 2
In [8]: pd.qcut(N, m, labels=False)
Out[8]: array([0, 1, 0, 1, 0], dtype=int64)
Tip for getting the bin middle points
If you want to return the bin edges, use labels=True (default). This will allow you to get the bin middle points with:
In [26]: intervals = pd.qcut(N, 2)
In [27]: [i.mid for i in intervals]
Out[27]: [0.34950000000000003, 1.1, 0.34950000000000003, 1.1, 0.34950000000000003]
The intervals is an array of pandas.Interval objects (when labels=True).
See also: pd.cut, if you would like to make the bin width (not bin count) equal

Numpy: finding nonzero values along arbitrary dimension

It seems I just cannot solve this in Numpy: I have a matrix, with an arbitrary number of dimensions, ordered in an arbitrary way. Inside this matrix, there is always one dimension I am interested in (as I said, the position of this dimension is not always the same). Now, I want to find the first nonzero value along this dimension. In fact, I need the index of that value to perform some operations on the value itself.
An example: if my matrix a is n x m x p and the dimension I am interested in is number 1, I would do something like:
for ii in xrange(a.shape[0]):
for kk in xrange(a.shape[2]):
myview = np.squeeze(a[ii, :, kk])
firsti = np.nonzero(myview)[0][0]
myview[firsti] = dostuff
Apart from performance considerations, I really do not know how to do this having different number of dimensions, and having the dimension I am interested in an arbitrary position.
You can abuse np.argmax for your purpose. Here, you can specify the axis which you are interested in, where 0 is along columns, 1 is along rows, and so on. You just need an array which contains the same value for all elements that are not zero. You can achieve that by doing a != 0, as this will contain False (meaning 0) for all zero-elements and True (meaning 1) for all non-zero-elements. Now np.argmax(a != 0, axis=1) would give you the first non-zero element along the 1 axis.
For example:
import numpy as np
a = np.array([[0, 1, 4],[1, 0, 2],[0, 0, 1]])
# a = [[0, 1, 4],
# [1, 0, 2],
# [0, 0, 1]]
print(np.argmax(a!=0, axis=0))
# >>> array([1, 0, 0]) -> along columns
print(np.argmax(a!=0, axis=1))
# >>> array([1, 0, 2]) -> along rows
This will also work for higher dimension, but the output is less instructive, as it is hard to imagine.

NumPy speed up setting elements of 2D Array

I am trying to efficiently index a 2D array in Python and have the problem that it is really slow.
This is what I tried (simplified example):
xSize = veryBigNumber
ySize = veryBigNumber
a = np.ones((xSize,ySize))
N = veryBigNumber
const = 1
for t in range(N):
for i in range(xSize):
for j in range(ySize):
a[i,j] *= f(i,j)*const # f(i,j) is an arbitrary function of i and j.
Now I would like to substitute the nested loop by something more efficient. How do I do this?
Your 2D array could be produced using the following addition:
np.arange(200)[:,np.newaxis] + np.arange(200)
This type of vectorised operation is likely to be very fast:
>>> %timeit np.arange(200)[:,np.newaxis] + np.arange(200)
1000 loops, best of 3: 178 µs per loop
This method in not limited to addition. We can use the two arrays in the above operation as the arguments of any universal function (commonly abbreviated to ufunc).
For example:
>>> np.multiply(np.arange(5)[:,np.newaxis], np.arange(5))
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16]])
NumPy has built in ufuncs for all the basic arithmetic operations and some more interesting ones too. If you need a more exotic function, NumPy allows you to make your own ufunc.
Edit: To quickly explain the broadcasting happening in this method; you can think of it like this...
np.arange(5) produces 1D array which looks like this:
array([0, 1, 2, 3, 4])
The code np.arange(5)[:,np.newaxis] adds a second dimension (columns) to the range, producing this 2D array:
array([[0],
[1],
[2],
[3],
[4]])
To create the final 5x5 array using np.multiply (although we could use any ufunc or binary arithmetic operation), NumPy takes the 0 in the second array and mutliplies it with each elements it the first array making a row like this:
[ 0, 0, 0, 0, 0]
It then takes the second element in the second array, 1, and multiplies it with the first array, producing this row:
[ 0, 1, 2, 3, 4]
This continues until we have the final 5x5 matrix.
You could use the indices routine:
b=np.indices(a.shape)
a=b[0]+b[1]
Timings:
%%timeit
...: b=np.indices(a.shape)
...: c=b[0]+b[1]
1000 loops, best of 3: 370 µs per loop
%%timeit
for i in range(200):
for j in range(200):
a[i,j] = i + j
100 loops, best of 3: 10.4 ms per loop
Since your output matrix a is the element-wise power of N of a matrix F with elements f_ij = f(i,j) * const your code can simplify to
F = np.empty((xSize, ySize))
for i in range(xSize):
for j in range(ySize):
F[i,j] = f(i,j) * const
a = F ** n
For even more speed you can exchange the creation of the F matrix with something more efficient, given that the function f(i,j) is vectorized:
xmap, ymap = numpy.meshgrid(range(xSize), range(ySize))
F = f(xmap, ymap) * const

Resources