Efficiently calculating weighted distance in MATLAB - arrays

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.

You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end

For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

Related

a faster way to compute the error of a vector

For a given vector $(x_1,x_2,\ldots, x_n)$ I am trying to compute
I wrote the following code
for l = 1:n
for k = 1:n
error = error + norm(x(i)-x(j))
end
end
This code is not fast, especially when $n$ is large. I am aware that I am double counting actually... But how may I avoid it? How can I speed up my code?
Thank you!
You can do it with bsxfun, which is fast:
d = (abs(bsxfun(#minus, x, x.')));
result = sum(d(:));
Or alternatively use pdist with 'cityblock' distance (which for one-dimensional observations reduces to absolute difference). This computes each distance once, so you need to multiply the sum by 2:
result = 2*sum(pdist(x(:),'cityblock'));
How about a simple speed up?
for a=1:n
for b=a+1:n
error = error + 2*norm(x(a)-x(b))
end
end
For a scalar, norm just gives abs.
So,
error = sum(abs( bsxfun(#minus, error,error') ))
will do the same thing.
also check out pdist which will do this for vectors, using vector norms, in an even faster way.

Conditional Sum in Array

I have 2 arrays, A and B. I want to form a new array C with same dimension as B where each element will show SUM(A) for A > B
Below is my working code
A = [1:1:1000]
B=[1:1:100]
for n = 1:numel(B)
C(n) = sum(A(A>B(n)));
end
However, when A has millions of rows and B has thousands, and I have to do similar calculations for 20 array-couples,it takes insane amount of time.
Is there any faster way?
For example, histcounts is pretty fast, but it counts, rather than summing.
Thanks
Depending on the size of your arrays (and your memory limitations), the following code might be slightly faster:
C = A*bsxfun(#gt,A',B);
Though it's vectorized, however, it seems to be bottlenecked (perhaps) by the allocation of memory. I'm looking to see if I can get a further speedup. Depending on your input vector size, I've seen up to a factor of 2 speedup for large vectors.
Here's a method that is a bit quicker, but I'm sure there is a better way to solve this problem.
a=sort(A); %// If A and B are already sorted then this isn't necessary!
b=sort(B);
c(numel(B))=0; %// Initialise c
s=cumsum(a,2,'reverse'); %// Get the partial sums of a
for n=1:numel(B)
%// Pull out the sum for elements in a larger than b(n)
c(n)=s(find(a>b(n),1,'first'));
end
According to some very rough tests, this seems to run a bit better than twice as fast as the original method.
You had the right ideas with histcounts, as you are basically "accumulating" certain A elements based on binning. This binning operation could be done with histc. Listed in this post is a solution that starts off with similar steps as listed in #David's answer and then uses histc to bin and sum up selective elements from A to get us the desired output and all of it in a vectorized manner. Here's the implementation -
%// Sort A and B and also get sorted B indices
sA = sort(A);
[sB,sortedB_idx] = sort(B);
[~,bin] = histc(sB,sA); %// Bin sorted B onto sorted A
C_out = zeros(1,numel(B)); %// Setup output array
%// Take care of the case when all elements in B are greater than A
if sA(1) > sB(end)
C_out(:) = sum(A);
end
%// Only do further processing if there is at least one element in B > any element in A
if any(bin)
csA = cumsum(sA,'reverse'); %// Reverse cumsum on sorted A
%// Get sum(A(A>B(n))) for every n, but for sorted versions
valid_mask = cummax(bin) - bin ==0;
valid_mask2 = bin(valid_mask)+1 <= numel(A);
valid_mask(1:numel(valid_mask2)) = valid_mask2;
C_out(valid_mask) = csA(bin(valid_mask)+1);
%// Rearrange C_out to get back in original unsorted version
[~,idx] = sort(sortedB_idx);
C_out = C_out(idx);
end
Also, please remember when comparing the result from this method with the one from the original for-loop version that there would be slight variations in output as this vectorized solution uses cumsum which computes a running summation and as such would have large cumulatively summed numbers being added to individual elements that are comparatively very small, whereas the for-loop version
would sum only selective elements. So, floating-precision issues would come up there.

How to improve the execution time of this function?

Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:
This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.

What could cause liblinear to reach the maximal number of iterations?

I use liblinear with my program to perform multi-class classification with the L2R_L2LOSS_SVC_DUAL solver. In the current test-setup I have 1600 instances from a total of 9 classes with 1000 features each.
I'm trying to determine the optimal C parameter for training with 5-fold cross-validation, but even with a small C of 1.0 liblinear reaches the maximal number of iterations:
................................................................................
....................
optimization finished, #iter = 1000
WARNING: reaching max number of iterations
Using -s 2 may be faster (also see FAQ)
Objective value = -637.100923
nSV = 783
The FAQ site mentions two possible reasons for this:
Data isn't scaled.
A large C parameter is used.
A lot of instances with a small number of features is used, so that the solver L2R_L2LOSS_SVC may be faster.
Neither one applies to my case. Since my feature vector is some kind of histogram, there is a natural maximum, that I use to scale the features to [0,1].
I set up the parameteres for liblinear as follows:
struct parameter svmParams;
svmParams.solver_type = L2R_L2LOSS_SVC_DUAL;
svmParams.eps = 0.1;
svmParams.nr_weight = 0;
svmParams.weight_label = NULL;
svmParams.weight = NULL;
svmParams.p = 0.1;
svmParams.C = 1.0;
My question is: What other reasons, not mentioned in the FAQ, may cause liblinear to operate slow in this scenario and what may I do against it?

Is indexing vectors in MATLAB inefficient?

Background
My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:
MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.
The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.
Investigating performance of vector indexing
The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector x:
tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.
Having x I would like to set all its entries to the same value. In practice you would do it differently, e.g., x = value*ones(1e8,1), but the point here is to investigate the performance of vector indexing. The simplest way is to write:
tic; x(:) = 1; toc
Elapsed time is 0.094316 seconds.
Let's call it method 1 (from the value assigned to x). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained effective memory bandwidth and comparing it to the hardware memory bandwidth of my computer:
eff_bandwidth = numel(x) * 8 bytes per double * 2 / time
In the above, I multiply by 2 because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:
eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s
STREAM-benchmarked memory bandwidth of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.
Method 1 is suitable if you want to set all vector elements to some value. But if you want to access elements every step entries, you need to substitute the : with e.g., 1:step:end. Below is a direct speed comparison with method 1:
tic; x(1:end) = 2; toc
Elapsed time is 0.496476 seconds.
While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (1:end). This is somewhat confirmed by using explicit vector size instead of end:
tic; x(1:1e8) = 3; toc
Elapsed time is 0.482083 seconds.
Methods 2 and 3 perform equally bad.
Another possibility is to explicitly create an index vector id and use it to index x. This gives you the most flexible indexing capabilities. In our case:
tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
Elapsed time is 1.208419 seconds.
Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for id, but why is it so much worse than methods 2 and 3?
Let's try to give the loops a try - as hopeless as it may sound.
tic;
for i=1:numel(x)
x(i) = 5;
end
toc
Elapsed time is 0.788944 seconds.
A big surprise - a loop beats a vectorized method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:
tic;
for i=1:1e8
x(i) = 6;
end
toc
Elapsed time is 0.321246 seconds.
And that is the probably the most bizarre outcome of this study - a MATLAB-written loop significantly outperforms native vector indexing. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (1:end) is even slower.
Questions
is simple indexing in MATLAB very inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something?
why is method 4 (so much) slower than methods 2 and 3?
why does using 1e8 instead of numel(x) as a loop bound speed up the code by factor 2?
Edit
After reading Jonas's comment, here is another way to do that using logical indices:
tic;
id = logical(ones(1, 1e8));
x(id) = 7;
toc
Elapsed time is 0.613363 seconds.
Much better than method 4.
For convenience:
function test
tic; x = zeros(1,1e8); toc
tic; x(:) = 1; toc
tic; x(1:end) = 2; toc
tic; x(1:1e8) = 3; toc
tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
tic;
for i=1:numel(x)
x(i) = 5;
end
toc
tic;
for i=1:1e8
x(i) = 6;
end
toc
end
I can, of course, only speculate. However when I run your test with the JIT compiler enabled vs disabled, I get the following results:
% with JIT no JIT
0.1677 0.0011 %# init
0.0974 0.0936 %# #1 I added an assigment before this line to avoid issues with deferring
0.4005 0.4028 %# #2
0.4047 0.4005 %# #3
1.1160 1.1180 %# #4
0.8221 48.3239 %# #5 This is where "don't use loops in Matlab" comes from
0.3232 48.2197 %# #6
0.5464 %# logical indexing
Dividing shows us where there is any speed increase:
% withoutJit./withJit
0.0067 %# w/o JIT, the memory allocation is deferred
0.9614 %# no JIT
1.0057 %# no JIT
0.9897 %# no JIT
1.0018 %# no JIT
58.7792 %# numel
149.2010 %# no numel
The apparent speed-up on initialization happens, because with JIT turned off it appears that MATLAB delays the memory allocation until it is used, so x=zeros(...) does not do anything really. (thanks, #angainor).
Methods 1 through 4 don't seem to benefit from the JIT. I guess that #4 could be slow due to additional input testing in subsref to make sure that the input is of the proper form.
The numel result could have something to do with it being harder for the compiler to deal with uncertain number of iterations, or with some overhead due to checking whether the bound of the loop is ok (thought no-JIT tests suggest only ~0.1s for that)
Surprisingly, on R2012b on my machine, logical indexing seems to be slower than #4.
I think that this goes to show, once again, that MathWorks have done great work in speeding up code, and that "don't use loops" isn't always best if you're trying to get the fastest execution time (at least at the moment). Nevertheless, I find that vectorizing is in general a good approach, since (a) the JIT fails on more complex loops, and (b) learning to vectorize makes you understand Matlab a lot better.
Conclusion: If you want speed, use the profiler, and re-profile if you switch Matlab versions.
As pointed out by #Adriaan in the comments, nowadays it may be better to use timeit() to measure execution speed.
For reference, I used the following slightly modified test function
function tt = speedTest
tt = zeros(8,1);
tic; x = zeros(1,1e8); tt(1)=toc;
x(:) = 2;
tic; x(:) = 1; tt(2)=toc;
tic; x(1:end) = 2; tt(3)=toc;
tic; x(1:1e8) = 3; tt(4)=toc;
tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
tt(5)=toc;
tic;
for i=1:numel(x)
x(i) = 5;
end
tt(6)=toc;
tic;
for i=1:1e8
x(i) = 6;
end
tt(7)=toc;
%# logical indexing
tic;
id = true(1e8,1));
x(id)=7;
tt(8)=toc;
I do not have an answer to all the problems, but I do have some refined speculations on methods 2, 3 and 4.
Regarding methods 2 and 3. It does indeed seem that MATLAB allocates memory for the vector indices and fills it with values from 1 to 1e8. To understand it, lets see what is going on. By default, MATLAB uses double as its data type. Allocating the index array takes the same time as allocating x
tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.
For now, the index array contains only zeros. Assigning values to the x vector in an optimal way, as in method 1, takes 0.094316 seconds. Now, the index vector has to be read from the memory so that it can be used in indexing. That is additional 0.094316/2 seconds. Recall that in x(:)=1 vector x has to be both read from and written to the memory. So only reading it takes half the time. Assuming this is all that is done in x(1:end)=value, the total time of methods 2 and 3 should be
t = 0.260525+0.094316+0.094316/2 = 0.402
It is almost correct, but not quite. I can only speculate, but filling the index vector with values is probably done as an additional step and takes additional 0.094316 seconds. Hence, t=0.4963, which more or less fits with the time of methods 2 and 3.
These are only speculations, but they do seem to confirm that MATLAB explicitly creates index vectors when doing native vector indexing. Personally, I consider this to be a performance bug. MATLABs JIT compiler should be smart enough to understand this trivial construct and convert it to a call to a correct internal function. As it is now, on the todays memory bandwidth bounded architectures indexing performs at around 20% theoretical peak.
So if you do care about performance, you will have to implement x(1:step:end) as a MEX function, something like
set_value(x, 1, step, 1e8, value);
Now this is clearly illegal in MATLAB, since you are NOT ALLOWED to modify arrays in the MEX files inplace.
Edit Regarding method 4, one can try to analyze the performance of the individual steps as follows:
tic;
id = 1:1e8; % colon(1,1e8);
toc
tic
x(id) = 4;
toc
Elapsed time is 0.475243 seconds.
Elapsed time is 0.763450 seconds.
The first step, allocation and filling the values of the index vector takes the same time as methods 2 and 3 alone. It seems that it is way too much - it should take at most the time needed to allocate the memory and to set the values (0.260525s+0.094316s = 0.3548s), so there is an additional overhead of 0.12 seconds somewhere, which I can not understand. The second part (x(id) = 4) looks also very inefficient: it should take the time needed to set the values of x, and to read the id vector (0.094316s+0.094316/2s = 0.1415s) plus some error checks on the id values. Programed in C, the two steps take:
create id 0.214259
x(id) = 4 0.219768
The code used checks that a double index in fact represents an integer, and that it fits the size of x:
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("create id");
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 4;
} else break;
}
toc("x(id) = 4");
The second step takes longer than the expected 0.1415s - that is due to the necessity of error checks on id values. The overhead seems too large to me - maybe it could be written better. Still, the time required is 0.4340s , not 1.208419s. What MATLAB does under the hood - I have no idea. Maybe it is necessary to do it, I just don't see it.
Of course, using doubles as indices introduces two additional levels of overhead:
size of double twice the size of uint32. Recall that memory bandwidth is the limiting factor here.
doubles need to be cast to integers for indexing
Method 4 can be written in MATLAB using integer indices:
tic;
id = uint32(1):1e8;
toc
tic
x(id) = 8;
toc
Elapsed time is 0.327704 seconds.
Elapsed time is 0.561121 seconds.
Which clearly improved the performance by 30% and proves that one should use integers as vector indices. However, the overhead is still there.
As I see it now, we can not do anything to improve the situation working within the MATLAB framework, and we have to wait till Mathworks fixes these issues.
Just a quick note to show how in 8 years of development, the performance characteristics of MATLAB have changed a lot.
This is on R2017a (5 years after OP's post):
Elapsed time is 0.000079 seconds. % x = zeros(1,1e8);
Elapsed time is 0.101134 seconds. % x(:) = 1;
Elapsed time is 0.578200 seconds. % x(1:end) = 2;
Elapsed time is 0.569791 seconds. % x(1:1e8) = 3;
Elapsed time is 1.602526 seconds. % id = 1:1e8; x(id) = 4;
Elapsed time is 0.373966 seconds. % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.374775 seconds. % for i=1:1e8, x(i) = 6; end
Note how the loop for 1:numel(x) is faster than indexing x(1:end), it seems that the array 1:end is still being created, whereas for the loop it is not. It is now better in MATLAB to not vectorize!
(I did add an assignment x(:)=0 after allocating the matrix, outside of any timed regions, to actually have the memory allocated, since zeros only reserves the memory.)
On MATLAB R2020b (online) (3 years later) I see these times:
Elapsed time is 0.000073 seconds. % x = zeros(1,1e8);
Elapsed time is 0.084847 seconds. % x(:) = 1;
Elapsed time is 0.084643 seconds. % x(1:end) = 2;
Elapsed time is 0.085319 seconds. % x(1:1e8) = 3;
Elapsed time is 1.393964 seconds. % id = 1:1e8; x(id) = 4;
Elapsed time is 0.168394 seconds. % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.169830 seconds. % for i=1:1e8, x(i) = 6; end
x(1:end) is now optimized in the same as x(:), the vector 1:end is no longer being explicitly created.

Resources