Network Formation and Large Array's in Matlab Optimization - arrays

I am getting an error using repmat. My Matlab version is 2017a. "Requested 3711450x2726 (75.4GB) array exceeds maximum array size..." First, some context.
I have an adjacency matrix of social network data call it D. D is 2725x2725 with 1s denoting a link between agents i and j and 0s otherwise. I have been provided a function and sub-functions for a network formation model. There are K regressors (x variables). The model requires forming a dyad-specific regressor matrix W that is W = 0.5N(N-1) x K. In my data, this is 3711450 x K. For a start, I select only one x variable so K=1.
In the main function, there are two steps. The first step calculates the joint MLE from a logit. I have a problem in the second step computation of the variance covariance matrix with array size. Inside this step, there is a calculation that creates a 3711450 x n (2725) matrix using repmat.
INFO = ((repmat((exp_Xbeta ./ (1+exp_Xbeta).^2),1,K) .* X)'*X);
exp_Xbeta is 3711450 x K and X is a sparse 3711450 x 2725 matrix with Bytes = 178171416 of class double. The error occurs at INFO.
I've tried converting X to a tall matrix but thus far no joy. I've tried adding sparse to the INFO line but again no joy. Anyone have any ideas short of going to a cluster or getting more ram? Could I somehow convert X from a sparse matrix to a full matrix inside a datastore and then call the datastore using tall? I have not been able to figure out how to do that if it is possible.
Once INFO is constructed as an array it will be used later in one of the sub-functions. So, it needs to be callable. In case you're curious, INFO is the second derivative matrix.

I have found that producing the INFO matrix all at once was too much for my memory constraints. I split up the steps, but still, repmat and subsequent steps were a problem. Now, I've turned to building up the INFO matrix one step at a time, while never holding more than exp_Xbeta, X, and two vectors in memory. Replacing the construction of INFO with
for i = 1:d
s1_i = step1(:,1).*X(:,i);
s1_i = s1_i';
for j = 1:d;
INFO(i,j) = s1_i*X(:,j);
end
clear s1_i;
end
has dropped the memory requirement, though its slow, and things seem to be working. For anyone interested, below is a little example illustrating the point.
clear all
N = 20
n = 0.5*N*(N-1)
exp_Xbeta = rand(n,1);
X = rand(n,N);
step1 = (exp_Xbeta ./ (1+exp_Xbeta).^2);
[c,d] = size(X);
INFO = zeros(d,d);
for i = 1:d
s1_i = step1(:,1).*X(:,i)
s1_i = s1_i'
for j = 1:d
INFO(i,j) = s1_i*X(:,j)
end
clear s1_i
end
K = 1
INFO2 = ((repmat((exp_Xbeta ./ (1+exp_Xbeta).^2),1,K) .* X)'*X);
% Methods produce equivalent matrices
INFO
INFO2

Related

Matrix calculations without loops in MATLAB

I have an issue with a code performing some array operations. It is too slow, because I use loops and input data are quite big. It was the easiest way for me, but now I am looking for something faster than for loops. I was trying to optimize or rewrite code, but unsuccessful. I really aprecciate Your help.
In my code I have three arrays x1, y1 (coordinates of points in grid), g1 (values in the points) and for example their size is 300 x 300. I treat each matrix as composition of 9 and I make calculation for points in the middle one. For example I start with g1(101,101), but I am using data from g1(1:201,1:201)=g2. I need to calculate distance from each point of g1(1:201,1:201) to g1(101,101) (ll matrix), then I calculate nn as it is in the code, next I find value for g1(101,101) from nn and put it in N array. Then I go to g1(101,102) and so on until g1(200,200), where in this last case g2=g1(99:300,99:300).
As i said, this code is not very efficient, even I have to use larger arrays than I gave in the example, it takes too much time. I hope I explain enough clearly what I expect from the code. I was thinking of using arrayfun, but I have never worked with this function, so I don't know how should use it, however it seems to me it won't handle. Maybe there are other solutions, however I couldn't find anything apropriate.
tic
x1=randn(300,300);
y1=randn(300,300);
g1=randn(300,300);
m=size(g1,1);
n=size(g1,2);
w=1/3*m;
k=1/3*n;
N=zeros(w,k);
for i=w+1:2*w
for j=k+1:2*k
x=x1(i,j);
y=y1(i,j);
x2=y1(i-k:i+k,j-w:j+w);
y2=y1(i-k:i+k,j-w:j+w);
g2=g1(i-k:i+k,j-w:j+w);
ll=1./sqrt((x2-x).^2+(y2-y).^2);
ll(isinf(ll))=0;
nn=ifft2(fft2(g2).*fft2(ll));
N(i-w,j-k)=nn(w+1,k+1);
end
end
czas=toc;
For what it's worth, arrayfun() is just a wrapper for a for loop, so it wouldn't lead to any performance improvements. Also, you probably have a typo in the definition of x2, I'll assume that it depends on x1. Otherwise it would be a superfluous variable. Also, your i<->w/k, j<->k/w pairing seems inconsistent, you should check that as well. Also also, just timing with tic/toc is rarely accurate. When profiling your code, put it in a function and run the timing multiple times, and exclude the variable generation from the timing. Even better: use the built-in profiler.
Disclaimer: this solution will likely not help for your actual problem due to its huge memory need. For your input of 300x300 matrices this works with arrays of size 300x300x100x100, which is usually a no-go. Still, it's here for reference with a smaller input size. I wanted to add a solution based on nlfilter(), but your problem seems to be too convoluted to be able to use that.
As always with vectorization, you can do it faster if you can spare the memory for it. You are trying to work with matrices of size [2*k+1,2*w+1] for each [i,j] index. This calls for 4d arrays, of shape [2*k+1,2*w+1,w,k]. For each element [i,j] you have a matrix with indices [:,:,i,j] to treat together with the corresponding elements of x1 and y1. It also helps that fft2 accepts multidimensional arrays.
Here's what I mean:
tic
x1 = randn(30,30); %// smaller input for tractability
y1 = randn(30,30);
g1 = randn(30,30);
m = size(g1,1);
n = size(g1,2);
w = 1/3*m;
k = 1/3*n;
%// these will be indexed on the fly:
%//x = x1(w+1:2*w,k+1:2*k); %// size [w,k]
%//y = x1(w+1:2*w,k+1:2*k); %// size [w,k]
x2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
y2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
g2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
%// manual definition for now, maybe could be done smarter:
for ii=w+1:2*w %// don't use i and j as variables
for jj=k+1:2*k %// don't use i and j as variables
x2(:,:,ii-w,jj-k) = x1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
y2(:,:,ii-w,jj-k) = y1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
g2(:,:,ii-w,jj-k) = g1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
end
end
%// use bsxfun to operate on [2*k+1,2*w+1,w,k] vs [w,k]-sized arrays
%// need to introduce leading singletons with permute() in the latter
%// in order to have shape [1,1,w,k] compatible with the first array
ll = 1./sqrt(bsxfun(#minus,x2,permute(x1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2 ...
+ bsxfun(#minus,y2,permute(y1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2);
ll(isinf(ll)) = 0;
%// compute fft2, operating on [2*k+1,2*w+1,w,k]
%// will return fft2 for each index in the [w,k] subspace
nn = ifft2(fft2(g2).*fft2(ll));
%// we need nn(w+1,k+1,:,:) which is exactly of size [w,k] as needed
N = reshape(nn(w+1,k+1,:,:),[w,k]); %// quicker than squeeze()
N = real(N); %// this solution leaves an imaginary part of around 1e-12
czas=toc;

Matlab: creating 3D arrays as a function of 2D arrays

I want to create 3d arrays that are functions of 2d arrays and apply matrix operations on each of the 2D arrays. Right now I am using for loop to create a series of 2d arrays, as in the code below:
for i=1:50
F = [1 0 0; 0 i/10 0; 0 0 1];
B=F*F';
end
Is there a way to do this without the for loop? I tried things such as:
F(2,2) = 0:0.1:5;
and:
f=1:0.1:5;
F=[1 0 0; 0 f 0; 0 0 1];
to create them without the loop, but both give errors of dimension inconsistency.
I also want to perform matrix operations on F in my code, such as
B=F*F';
and want to plot certain components of F as a function of something else. Is it possible to completely eliminate the for loop in such a case?
If I understand what you want correctly, you want 50 2D matrices stacked into a 3D matrix where the middle entry varies from 1/10 to 50/10 = 5 in steps of 1/10. You almost have it correct. What you would need to do is first create a 3D matrix stack, then assign a 3D vector to the middle entry.
Something like this would do:
N = 50;
F = repmat(eye(3,3), [1 1 N]);
F(2,2,:) = (1:N)/10; %// This is 1/10 to 5 in steps of 1/10... or 0.1:0.1:5
First pre-allocate a matrix F that is the identity matrix for all slices, then replace the middle row and middle column of each slice with i/10 for i = 1, 2, ..., 50.
Therefore, to get the ith slice, simply do:
out = F(:,:,i);
Minor Note
I noticed that what you want to do in the end is a matrix multiplication of the 3D matrices. That operation is not defined in MATLAB nor anywhere in a linear algebra context. If you want to multiply each 2D slice independently, you'd be better off using a for loop. Doing this vectorized with native operations isn't supported in this context.
To do it in a loop, you'd do something like this for each slice:
B = zeros(size(F));
for ii = 1 : size(B,3)
B(:,:,ii) = F(:,:,ii)*F(:,:,ii).';
end
... however, examining the properties of your matrix, the only thing that varies is the middle entry. If you perform a matrix multiplication, all of the entries per slice are going to be the same... except for the middle, where the entry is simply itself squared. It doesn't matter if you multiple one slice by the transpose of the other. The transpose of the identity is still the identity.
If your matrices are going to be like this, you can just perform an element-wise multiplication with itself:
B = F.*F;
This will not work if F is anything else but what you have above.
Creating the matrix would be easy:
N = 50;
S = cell(1,N);
S(:) = {eye(3,3)};
F = cat(3, S{:});
F(2,2,:) = (1:N)/10;
Another (faster) way would be:
N = 50;
F = zeros(3,3,N);
F(1,1,:) = 1;
F(2,2,:) = (1:N)/10;
F(3,3,:) = 1;
You then can get the 3rd matrix (for example) by:
F(:,:,3)

MATLAB to C-code

I am following MathWorks guide to converting MATLAB code to C-code.
The first step is to enter
%#codegen
after every function that I want converted to C-code, however doing so has given me the following prompt on the code below.
function lanes=find_lanes(B,h, stats)
% Find the regions that look like lanes
%#codegen
lanes = {};
l=0;
for k = 1:length(B)
metric = stats(k).MajorAxisLength/stats(k).MinorAxisLength;
%testlane(k);
%end
%function testlane(k)
coder.inline('never');
if metric > 5 & all(B{k}(:,1)>100)
l=l+1;
lanes(l,:)=B(k);
else
delete(h(k))
end
end
end
around the curly braces:
code generation only supports cell operations for "varargin" and
"varargout"
Another prompt says
Code generation does not support variable "lanes" size growth through indexing
where lanes is mentioned for the second time.
The input Arguments for the function are:
B - Is the output of the bwboundaries Image Processing toolbox function. It is a P-by-1 cell array, where P is the number of objects and holes. Each cell in the cell array contains a Q-by-2 matrix. Each row in the matrix contains the row and column coordinates of a boundary pixel. Q is the number of boundary pixels for the corresponding region.
h - plots the boundaries of the objects with a green outline while being a matrix of size 1 X length(B), holding the values of the boundaries like so like so:
h(K)=plot(boundary(:,2), boundary(:,1), 'g', 'LineWidth', 2);//boundary(:,1) - Y coordinate, boundary(:,2) - X coordinate.
stats - 19x1 struct array acquired using the regionprops function from the Image Processing toolbox with fields:
MajorAxisLength and
MinorAxisLength (of the object)
I would really appreciate any input you can give in helping me clear this error. Thanks in Advance!
Few points about your code generation -
Only a subset of functions in MATLAB and Image Processing Toolbox support code generation - Image Processing Toolbox support for code generation.
Cell arrays do not support code generation yet - Cell array support.
In your code, it seems like your variable is growing i.e. the initial size of the array is not able to support your workflow. You should follow code generation for variable sized inputs.
I had a similar error i.e. code generation does not support variable size growth through indexing. Inside my for loop I had a statement as such which had the same error:
y(i) = k;
I introduced a temporary storage variable u and modified my code to:
u = y;
u(i) = k;
y = u;
I suggest you do the same for your variable lanes.

store data in array from loop in matlab

I want to store data coming from for-loops in an array. How can I do that?
sample output:
for x=1:100
for y=1:100
Diff(x,y) = B(x,y)-C(x,y);
if (Diff(x,y) ~= 0)
% I want to store these values of coordinates in array
% and find x-max,x-min,y-max,y-min
fprintf('(%d,%d)\n',x,y);
end
end
end
Can anybody please tell me how can i do that. Thanks
Marry
So you want lists of the x and y (or row and column) coordinates at which B and C are different. I assume B and C are matrices. First, you should vectorize your code to get rid of the loops, and second, use the find() function:
Diff = B - C; % vectorized, loops over indices automatically
[list_x, list_y] = find(Diff~=0);
% finds the row and column indices at which Diff~=0 is true
Or, even shorter,
[list_x, list_y] = find(B~=C);
Remember that the first index in matlab is the row of the matrix, and the second index is the column; if you tried to visualize your matrices B or C or Diff by using imagesc, say, what you're calling the X coordinate would actually be displayed in the vertical direction, and what you're calling the Y coordinate would be displayed in the horizontal direction. To be a little more clear, you could say instead
[list_rows, list_cols] = find(B~=C);
To then find the maximum and minimum, use
maxrow = max(list_rows);
minrow = min(list_rows);
and likewise for list_cols.
If B(x,y) and C(x,y) are functions that accept matrix input, then instead of the double-for loop you can do
[x,y] = meshgrid(1:100);
Diff = B(x,y)-C(x,y);
mins = min(Diff);
maxs = max(Diff);
min_x = mins(1); min_y = mins(2);
max_x = maxs(1); max_y = maxs(2);
If B and C are just matrices holding data, then you can do
Diff = B-C;
But really, I need more detail before I can answer this completely.
So: are B and C functions, matrices? You want to find min_x, max_x, but in the example you give that's just 1 and 100, respectively, so...what do you mean?

Eigenvector computation using OpenCV

I have this matrix A, representing similarities of pixel intensities of an image. For example: Consider a 10 x 10 image. Matrix A in this case would be of dimension 100 x 100, and element A(i,j) would have a value in the range 0 to 1, representing the similarity of pixel i to j in terms of intensity.
I am using OpenCV for image processing and the development environment is C on Linux.
Objective is to compute the Eigenvectors of matrix A and I have used the following approach:
static CvMat mat, *eigenVec, *eigenVal;
static double A[100][100]={}, Ain1D[10000]={};
int cnt=0;
//Converting matrix A into a one dimensional array
//Reason: That is how cvMat requires it
for(i = 0;i < affnDim;i++){
for(j = 0;j < affnDim;j++){
Ain1D[cnt++] = A[i][j];
}
}
mat = cvMat(100, 100, CV_32FC1, Ain1D);
cvEigenVV(&mat, eigenVec, eigenVal, 1e-300);
for(i=0;i < 100;i++){
val1 = cvmGet(eigenVal,i,0); //Fetching Eigen Value
for(j=0;j < 100;j++){
matX[i][j] = cvmGet(eigenVec,i,j); //Fetching each component of Eigenvector i
}
}
Problem: After execution I get nearly all components of all the Eigenvectors to be zero. I tried different images and also tried populating A with random values between 0 and 1, but the same result.
Few of the top eigenvalues returned look like the following:
9805401476911479666115491135488.000000
-9805401476911479666115491135488.000000
-89222871725331592641813413888.000000
89222862280598626902522986496.000000
5255391142666987110400.000000
I am now thinking on the lines of using cvSVD() which performs singular value decomposition of real floating-point matrix and might yield me the eigenvectors. But before that I thought of asking it here. Is there anything absurd in my current approach? Am I using the right API i.e. cvEigenVV() for the right input matrix (my matrix A is a floating point matrix)?
cheers
Note to readers: This post at first may seem unrelated to the topic, but please refer to the discussion in the comments above.
The following is my attempt at implementing the Spectral Clustering algorithm applied to image pixels in MATLAB. I followed exactly the paper mentioned by #Andriyev:
Andrew Ng, Michael Jordan, and Yair Weiss (2002).
On spectral clustering: analysis and an algorithm.
In T. Dietterich, S. Becker, and Z. Ghahramani (Eds.),
Advances in Neural Information Processing Systems 14. MIT Press
The code:
%# parameters to tune
SIGMA = 2e-3; %# controls Gaussian kernel width
NUM_CLUSTERS = 4; %# specify number of clusters
%% Loading and preparing a sample image
%# read RGB image, and make it smaller for fast processing
I0 = im2double(imread('house.png'));
I0 = imresize(I0, 0.1);
[r,c,~] = size(I0);
%# reshape into one row per-pixel: r*c-by-3
%# (with pixels traversed in columwise-order)
I = reshape(I0, [r*c 3]);
%% 1) Compute affinity matrix
%# for each pair of pixels, apply a Gaussian kernel
%# to obtain a measure of similarity
A = exp(-SIGMA * squareform(pdist(I,'euclidean')).^2);
%# and we plot the matrix obtained
imagesc(A)
axis xy; colorbar; colormap(hot)
%% 2) Compute the Laplacian matrix L
D = diag( 1 ./ sqrt(sum(A,2)) );
L = D*A*D;
%% 3) perform an eigen decomposition of the laplacian marix L
[V,d] = eig(L);
%# Sort the eigenvalues and the eigenvectors in descending order.
[d,order] = sort(real(diag(d)), 'descend');
V = V(:,order);
%# kepp only the largest k eigenvectors
%# In this case 4 vectors are enough to explain 99.999% of the variance
NUM_VECTORS = sum(cumsum(d)./sum(d) < 0.99999) + 1;
V = V(:, 1:NUM_VECTORS);
%% 4) renormalize rows of V to unit length
VV = bsxfun(#rdivide, V, sqrt(sum(V.^2,2)));
%% 5) cluster rows of VV using K-Means
opts = statset('MaxIter',100, 'Display','iter');
[clustIDX,clusters] = kmeans(VV, NUM_CLUSTERS, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton');
%% 6) assign pixels to cluster and show the results
%# assign for each pixel the color of the cluster it belongs to
clr = lines(NUM_CLUSTERS);
J = reshape(clr(clustIDX,:), [r c 3]);
%# show results
figure('Name',sprintf('Clustering into K=%d clusters',NUM_CLUSTERS))
subplot(121), imshow(I0), title('original image')
subplot(122), imshow(J), title({'clustered pixels' '(color-coded classes)'})
... and using a simple house image I drew in Paint, the results were:
and by the way, the first 4 eigenvalues used were:
1.0000
0.0014
0.0004
0.0002
and the corresponding eigenvectors [columns of length r*c=400]:
-0.0500 0.0572 -0.0112 -0.0200
-0.0500 0.0553 0.0275 0.0135
-0.0500 0.0560 0.0130 0.0009
-0.0500 0.0572 -0.0122 -0.0209
-0.0500 0.0570 -0.0101 -0.0191
-0.0500 0.0562 -0.0094 -0.0184
......
Note that there are step performed above which you didn't mention in your question (Laplacian matrix, and normalizing its rows)
I would recommend this article. The author implements Eigenfaces for face recognition. On page 4 you can see that he uses cvCalcEigenObjects to generate the eigenvectors from an image. In the article the whole pre processing step necessary for this computations are shown.
Here's a not very helpful answer:
What does theory (or maths scribbled on a piece of paper) tell you the eigenvectors ought to be ? Approximately.
What does another library tell you the eigenvectors ought to be ? Ideally what does a system such as Mathematica or Maple (which can be persuaded to compute to arbitrary precision) tell you the eigenvectors ought to be ? If not for a production-sixed problem at least for a test-sized problem.
I'm not an expert with image processing so I can't be much more helpful, but I spend a lot of time with scientists and experience has taught me that a lot of tears and anger can be avoided by doing some maths first and forming an expectation of what results you ought to get before wondering why you got 0s all over the place. Sure it might be an error in the implementation of an algorithm, it might be loss of precision or some other numerical problem. But you don't know and shouldn't follow up those lines of inquiry yet.
Regards
Mark

Resources