Optimizing matrix calculations in for loops in Octave - loops

I imported code from Matlab to Octave and the speed of certain functions seems to have dropped.
I looked into vectorization and could not come up with a solution with my limited knowledge.
What i want to ask, is there a way to speed this up?
n = 181;
N = 250;
for i=1:n
for j=1:n
par=0;
for k=1:N;
par=par+log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
end
resultingMatrix(j,i)=2.^((1/N).*par)-1;
end
end
Where dimensions are:
matrix1 = 181x181x2,
matrix2 = 181x181 --> containing values either 1 or 2 only,
matrix3 = 181x181,
double1, double2 = just doubles

Here's my testing code, I've completed your code by making some random matrices:
n = 181;
N = 250;
matrix1 = rand(n,n,2);
matrix2 = randi(2,n,n);
matrix3 = rand(n,n);
double1 = 1;
double2 = 1;
tic
for i=1:n
for j=1:n
par=0;
for k=1:N
par=par+log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
end
resultingMatrix(j,i)=2.^((1/N).*par)-1;
end
end
toc
Note that the code inside the loop over k doesn't use k. This makes the loop superfluous. We can easily remove it. The loop does the same computation 250 times, adds up the results, and divides by 250, yielding the value of one of the repeated computations.
Another important thing to do is preallocate resultingMatrix, to avoid it growing with every loop iteration.
This is the resulting code:
tic
resultingMatrix2 = zeros(n,n);
for i=1:n
for j=1:n
par=log2(1+(10.^(matrix1(j,i,matrix2(j,i))./10)./(matrix3(j,i).*double1+double2)));
resultingMatrix2(j,i)=2.^par-1;
end
end
toc
max(abs((resultingMatrix(:)-resultingMatrix2(:))./resultingMatrix(:)))
The last line computes the maximum relative difference. It is 9.9424e-15 in my version of Octave. It will differ depending on the version, the system, and more. This error is the floating-point rounding error. Note that the original code, adding the same value 250 times, and then dividing it by 250, will produce a larger rounding error than the modified code. For example,
x = pi;
t = 0;
for i = 1:N
t = t + x;
end;
t = t / N;
t-x
gives -8.4377e-15, a similar rounding error to what we saw above.
The original code took 81.5 s, the modified code takes only 0.4 s. This is not a gain of vectorization, it is just a gain of preallocation and not needlessly repeating the same computation over and over again.
Next, we can remove the other two loops by vectorizing the operations. The difficult bit here is matrix1(j,i,matrix2(j,i)). We can produce each of the n*n linear indices with (1:n*n).' + (matrix2(:)-1)*(n*n). This is not trivial, I suggest you think about how this computation works. You need to know that linear indices count, starting at 1 for the top-left array element, first down, then right, then along the 3rd dimension. So 1:n*n is simply the linear indices for each of the elements of a 2D array, in order. To each of these we add n*n if we need to access the 2nd element along the 3rd dimension.
We now have the code
tic
index = reshape((1:n*n).' + (matrix2(:)-1)*(n*n), n, n);
par = log2(1+(10.^(matrix1(index)./10)./(matrix3.*double1+double2)));
resultingMatrix3 = 2.^par-1;
toc
max(abs((resultingMatrix(:)-resultingMatrix3(:))./resultingMatrix(:)))
This code produces the exact same result as my previous version, and runs in only 0.013 s, 30 times faster than the non-vectorized code, and 6000 times faster than the original code.

Related

Running Time of an Algorithm

Sorry this is a three part question. I keep trying to get the first part, and I think that if I get that the rest will fall into place, but my running time isn't quite right. I understand that there are n iterations, but not how to calculate the inner loop's number of iterations without using value j
Consider the following basic problem. You’re given an array A consisting of n integers A[1], A[2], ...A[n]. You’d like to output a two-dimensional
n-by-n array B in which B[i,j] (for i =j, so it doesn’t matter what is output for these values.)
Here’s a simple algorithm to solve this problem.
For i=1, 2,...,n
For j=i+1, i+2, ... n
Add up array entries A[i] through A[j]
Store the result in B[i,]]
Endfor
Endfor
(a) For some function f that you should choose, give a bound of the form O(f(n)) on the running time of this algorithm on an input of size n (i.e., a bound on the number of operations performed by the algorithm).
(b) For this same function f, show that the running time of the algorithm on an input of size n is also ~2 (f(n)). (This shows an asymptotically tight bound of ®(f(n)) on the running time.)
(c) Although the algorithm you analyzed in parts (a) and (b) is the most
natural way to solve the problem--after all, it just iterates through the relevant entries of the array B, filling in a value for each--it contains some highly unnecessary sources of inefficiency. Give a different algorithm to solve this problem, with an asymptotically better running time. In other words, you should design an algorithm with running time O(g(n)), where limn-->infinity g(n)/f(n) = O.
I will go through the first part. Then, it will probably be clear enough for you to solve the second part. The third one is a completely independent question (and should therefore also be posted as a separate question if you need help with it).
To analyze the running time, we can start from the inside and gradually go outwards
Add up array entries A[i] through A[j]
Assuming the straight-forward implementation where you loop over the entries, this will give you a running time of j - i + 1 (abstract time units). The exact number will depend on the implementation and how you count operations. For the O(*) notation, this will not make a difference. I will keep these specific times and will not simplify them to some O(*) notation since you will probably need the specific times for part (b).
Store the result in B[i,j]
This has a running time of 1. Hence, the part inside the inner loop has a running time of j - i + 2. I will substitute the code with T(j - i + 2) onwards. So, the code we have left is:
For i=1, 2,...,n
For j=i+1, i+2, ... n
T(j - i + 2)
Endfor
Endfor
To find the running time of the inner loop, we need to solve the sum over the given bounds: Sum (for j from i+1 to n) (j - i + 2). It is an arithmetic series with the solution 1/2 * (i - n - 5) * (i - n). The code is now:
For i=1, 2,...,n
T(1/2 * (i - n - 5) * (i - n))
Endfor
Again solving the sum gives us the final running time of 1/6 * (n^3 + 6n^2 - 7n). And this function is in O(n^3).
As summing of A[i] to A[j] is j - i + 1, f(n) = \sum_{i=1}^n\sum_j={i+1}^n (j-i+1) + 1 which 1 is for adding value to B[i,j]. Hence, by a change variable from k = j - i + 1, f(n) = \sum_{i=1}^n \sum_{k=0}^{n-i+1}k = \sum_{i=1}^n (n-i +1)(n - i + 2)/2. By change variable h = n-i, f(n) = \sum_{h=1}^{n-i} (h + 1)(h + 2)/2. Hence, f(n) = n^3. It means the algorithms is O(n^3).
For the third part, you can use from this fact that B[i, j] = B[i, j-1] + A[j]. It means you can using from the previous computed sums to compute the forwarded sum. Using this fact, you change the j - i + 1 to 1. It means g(n) = n^2 isntead of n^3.

MATLAB: Improving for-loop

I need to multiply parts of a column vector with a fixed row vector. I solved this problem using a for-loop. However, I am wondering if the performance can be improved as I have to perform this kind of computation around 50 million times. Here's my code so far:
multMat = 1:5;
mat = randi(5,10,1);
windowSize = 5;
vout = nan(10,1);
for r = windowSize : 10
vout(r) = multMat * mat( (r - windowSize + 1) : r);
end
I was thinking about uisng arrayfun. However, first I don't know how to adress the cell range (i.e. the previous five cells including the current cell), and second, I am not sure if arrayfun will be any faster than using the loop?
This sliding vector multiplication you're describing is an example of what is known as convolution. The following produces the same result as the loop in your example:
vout = [nan(windowSize-1,1);
conv(mat,flip(multMat),'valid')];
If your output doesn't really need the leading NaN values which aren't overwritten in your loop then the conv expression is sufficient without concatenating the NaN elements to it.
For sufficiently large vectors this is of course not guaranteed to be as fast as you'd like it to be, but MATLAB's built-in convolution implementation is likely to be pretty close to an optimal tool for the job.

What's time complexity to delete the last element in array by MATLAB?

I don't know data structure of array in MATLAB. Does it use a FiFo?
I try to delete the last element from a column and row vector. The time complexity depends on size N.
How to delete last element in O(1)?
Seminally pop().
Following the timing script from this question, which deals with the opposite problem -- growing a vector, I tried both approaches to remove the last element:
building_array = building_array(1:end-1);
building_array(end) = [];
As Tommaso noted, the former (blue) is faster than the latter (red):
My guess as to why these two forms have different timings is that MATLAB's JIT (Just in Time Compiler) is more optimized for the one syntax than the other. There is no technical reason for one being faster than the other.
I am really surprised that the cost is linear in the number of elements, very different from the behavior of adding an element at the time.
Test code (modified from Peter Barrett Bryan):
num_averages = 500;
num_sims = 10000;
time_store = nan(num_sims, num_averages);
for i = 1:num_averages
building_array = rand(num_sims,1);
for j = 1:num_sims
tic;
building_array = building_array(1:end-1);
time_store(j, i) = toc;
end
end
After a quick test, I found out that an in-place reassignment is much faster than an element deletion. But the performance of both operations still depends on the vector size... I simply think this cannot be achieved in O(1) as you may wish, due to how Matlab internally handles memory.
First approach:
A = rand(100,1);
tic();
A(end) = [];
toc(); % Average elapsed time: 0.000015 seconds
B = rand(10000,1);
tic();
B(end) = [];
toc(); % Average elapsed time: 0.000061 seconds
Second approach:
A = rand(100,1);
tic();
A = A(1:end-1);
toc(); % Average elapsed time: 0.000007 seconds
B = rand(10000,1);
tic();
B = B(1:end-1);
toc(); % Average elapsed time: 0.000017 seconds
My knowledge of Matlab is not deep enough to allow me to explain exactly what happens under the hood and why there is such a big difference between the two approaches. But I can try a guess.
In the first approach, Matlab has to:
evaluate end to a real vector offset;
find out how many elements have to be removed;
allocate a new array of size total_elements - removed_elements;
take the untouched portions of the array and buffer copy them into the new array;
replace the previous array reference with the new one;
deallocate the previous array.
In the second one, Matlab has to:
evaluate end to a real vector offset;
allocate a new array of size indexed_elements;
take the desired ranges of the array and buffer copy them into the new array;
replace the previous array reference with the new one;
deallocate the previous array.
And yet, we are far from O(1).
DISCLAIMER: This is not a real suggestion, it is merely a stupid way to actually obtain an O(1) run-time
Right, with the disclaimer given, it is actually possible to make an O(1) pop command in "Matlab". The solution is to not do it in Matlab, but in Python. Confused?
Basically, you convert the vector to a Python list with py.list(), whereafter the O(1) pop command is possible to execute. Thus you can do something like:
a = randn(1,1e4);
li=py.list(a);
b = li.pop;
However, as you might have guessed typecasting and running python through Matlab is not exactly what I would call fast. So even though we can maintain a constant run-time that constant is simply too large for it to be of any use.
In the figure, blue is the Matlab/Python solution, whereas red(-ish) is the best solution, as given by Tommaso and Cris.
As it is clear, we maintain what looks like O(1), but at an cost.
Code for reference:
num_averages = 100;
num_sims = 10000;
time_store = nan(num_sims, num_averages);
for i = 1:num_averages
building_array = rand(1,num_sims);
li = py.list(building_array);
for j = 1:num_sims
tic;
li.pop;
time_store(j, i) = toc;
end
end
EDIT: The size at which this approach is faster than the pure Matlab solution is actually within some reasonable limit ~150000.
Note: The fluctuations in this figure are larger as my patience ran out and thus I reduced the number of averages to 5.
Make no mistake, this solution is still stupid and should not be used. The conversion is linear in size, converting back again destroys it even more. Thus the only case, where this solution is actually better is if you are solely interested in the popped element, though, in this case, a for-loop with indexing does the same thing a lot faster.
Depends on array length, use tic <_code_under_test_> toc, to measure the time
A = ones(1,50);
B = ones(1,10);
tic
A(1) = [];
toc
tic
A(49) = [];
toc
tic
B(1) = [];
toc
tic
B(9) = [];
toc
Elapsed time is 0.000195 seconds.
Elapsed time is 0.000085 seconds.
Elapsed time is 0.000061 seconds.
Elapsed time is 0.000051 seconds.
You can see, it's faster to delete the last element of array than the first.

How to average independent consecutive blocks of an array as fast as possible?

Here is the problem:
data = 1:0.5:(8E6+0.5);
An array of 16 million points, needs to be averaged every 10,000 elements.
Like this:
x = mean(data(1:10000))
But repeated N times, where N depends on the number of elements we average over
range = 10000;
N = ceil(numel(data)/range);
My current method is this:
data(1) = mean(data(1,1:range));
for i = 2:N
data(i) = mean(data(1,range*(i-1):range*i));
end
How can the speed be improved?
N.B: We need to overwrite the original array of data (essentially bin the data and average it)
data = 1:0.5:(8E6-0.5); % Your data, actually 16M-2 elements
N = 1e4; % Amount to average over
tmp = mod(numel(data),N); % find out whether it fits
data = [data nan(1,N-tmp)]; % add NaN if necessary
data2=reshape(data,N,[]); % reshape into a matrix
out = nanmean(data2,1); % get average over the rows, ignoring NaN
Visual confirmation that it works using plot(out)
Note that technically you can't do what you want if mod(numel(data),N) is not equal to 0, since then you'd have a remainder. I elected to average over everything in there, although ignoring the remainder is also an option.
If you're sure mod(numel(data),N) is zero every time, you can leave all that out and reshape directly. I'd not recommend using this though, because if your mod is not 0, this will error out on the reshape:
data = 1:0.5:(8E6+0.5); % 16M elements now
N = 1e4; % Amount to average over
out = sum(reshape(data,N,[]),1)./N; % alternative
This is a bit wasteful, but you can use movmean (which will handle the endpoints the way you want it to) and then subsample the output:
y = movmean(x, [0 9999]);
y = y(1:10000:end);
Even though this is wasteful (you're computing a lot of elements you don't need), it appears to outperform the nanmean approach (at least on my machine).
=====================
There's also the option to just compensate for the extra elements you added:
x = 1:0.5:(8E6-0.5);
K = 1e4;
Npad = ceil(length(x)/K)*K - length(x);
x((end+1):(end+Npad)) = 0;
y = mean(reshape(x, K, []));
y(end) = y(end) * K/(K - Npad);
reshape the data array into a 10000XN matrix, then compute the mean of each column using the mean function.

Matrix calculations without loops in MATLAB

I have an issue with a code performing some array operations. It is too slow, because I use loops and input data are quite big. It was the easiest way for me, but now I am looking for something faster than for loops. I was trying to optimize or rewrite code, but unsuccessful. I really aprecciate Your help.
In my code I have three arrays x1, y1 (coordinates of points in grid), g1 (values in the points) and for example their size is 300 x 300. I treat each matrix as composition of 9 and I make calculation for points in the middle one. For example I start with g1(101,101), but I am using data from g1(1:201,1:201)=g2. I need to calculate distance from each point of g1(1:201,1:201) to g1(101,101) (ll matrix), then I calculate nn as it is in the code, next I find value for g1(101,101) from nn and put it in N array. Then I go to g1(101,102) and so on until g1(200,200), where in this last case g2=g1(99:300,99:300).
As i said, this code is not very efficient, even I have to use larger arrays than I gave in the example, it takes too much time. I hope I explain enough clearly what I expect from the code. I was thinking of using arrayfun, but I have never worked with this function, so I don't know how should use it, however it seems to me it won't handle. Maybe there are other solutions, however I couldn't find anything apropriate.
tic
x1=randn(300,300);
y1=randn(300,300);
g1=randn(300,300);
m=size(g1,1);
n=size(g1,2);
w=1/3*m;
k=1/3*n;
N=zeros(w,k);
for i=w+1:2*w
for j=k+1:2*k
x=x1(i,j);
y=y1(i,j);
x2=y1(i-k:i+k,j-w:j+w);
y2=y1(i-k:i+k,j-w:j+w);
g2=g1(i-k:i+k,j-w:j+w);
ll=1./sqrt((x2-x).^2+(y2-y).^2);
ll(isinf(ll))=0;
nn=ifft2(fft2(g2).*fft2(ll));
N(i-w,j-k)=nn(w+1,k+1);
end
end
czas=toc;
For what it's worth, arrayfun() is just a wrapper for a for loop, so it wouldn't lead to any performance improvements. Also, you probably have a typo in the definition of x2, I'll assume that it depends on x1. Otherwise it would be a superfluous variable. Also, your i<->w/k, j<->k/w pairing seems inconsistent, you should check that as well. Also also, just timing with tic/toc is rarely accurate. When profiling your code, put it in a function and run the timing multiple times, and exclude the variable generation from the timing. Even better: use the built-in profiler.
Disclaimer: this solution will likely not help for your actual problem due to its huge memory need. For your input of 300x300 matrices this works with arrays of size 300x300x100x100, which is usually a no-go. Still, it's here for reference with a smaller input size. I wanted to add a solution based on nlfilter(), but your problem seems to be too convoluted to be able to use that.
As always with vectorization, you can do it faster if you can spare the memory for it. You are trying to work with matrices of size [2*k+1,2*w+1] for each [i,j] index. This calls for 4d arrays, of shape [2*k+1,2*w+1,w,k]. For each element [i,j] you have a matrix with indices [:,:,i,j] to treat together with the corresponding elements of x1 and y1. It also helps that fft2 accepts multidimensional arrays.
Here's what I mean:
tic
x1 = randn(30,30); %// smaller input for tractability
y1 = randn(30,30);
g1 = randn(30,30);
m = size(g1,1);
n = size(g1,2);
w = 1/3*m;
k = 1/3*n;
%// these will be indexed on the fly:
%//x = x1(w+1:2*w,k+1:2*k); %// size [w,k]
%//y = x1(w+1:2*w,k+1:2*k); %// size [w,k]
x2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
y2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
g2 = zeros(2*k+1,2*w+1,w,k); %// size [2*k+1,2*w+1,w,k]
%// manual definition for now, maybe could be done smarter:
for ii=w+1:2*w %// don't use i and j as variables
for jj=k+1:2*k %// don't use i and j as variables
x2(:,:,ii-w,jj-k) = x1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
y2(:,:,ii-w,jj-k) = y1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
g2(:,:,ii-w,jj-k) = g1(ii-k:ii+k,jj-w:jj+w); %// check w vs k here
end
end
%// use bsxfun to operate on [2*k+1,2*w+1,w,k] vs [w,k]-sized arrays
%// need to introduce leading singletons with permute() in the latter
%// in order to have shape [1,1,w,k] compatible with the first array
ll = 1./sqrt(bsxfun(#minus,x2,permute(x1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2 ...
+ bsxfun(#minus,y2,permute(y1(w+1:2*w,k+1:2*k),[3,4,1,2])).^2);
ll(isinf(ll)) = 0;
%// compute fft2, operating on [2*k+1,2*w+1,w,k]
%// will return fft2 for each index in the [w,k] subspace
nn = ifft2(fft2(g2).*fft2(ll));
%// we need nn(w+1,k+1,:,:) which is exactly of size [w,k] as needed
N = reshape(nn(w+1,k+1,:,:),[w,k]); %// quicker than squeeze()
N = real(N); %// this solution leaves an imaginary part of around 1e-12
czas=toc;

Resources