Matlab: Sum corresponding values if index is within a range - arrays

I have been going crazy trying to figure a way to speed this up. Right now my current code talks ~200 sec looping over 77000 events. I was hoping someone might be able to help me speed this up because I have to do about 500 of these.
Problem:
I have arrays (both 200000x1) that correspond to Energy and Position of a hit over 77000 events. I have the range of each event separated into two arrays, event_start and event_end. First thing I do is look for the position in a specific range, then I put the correspond energy in its own array. To get what I need out of this information, I loop through each event and its corresponding start/end to sum up all the energies from each it hit. My code is below:
indx_pos = find(pos>0.7 & pos<2.0);
energy = HitEnergy(indx_pos);
for i=1:n_events
Etotal(i) = sum(energy(find(indx_pos>=event_start(i) …
& indx_pos<=event_end(i))));
end
Sample input & output:
% Sample input
% pos and energy same length
n_events = 3;
event_start = [1 3 7]';
event_end = [2 6 8]';
pos = [0.75 0.8 2.1 3.6 1.9 0.5 21.0 3.1]';
HitEnergy = [0.002 0.004 0.01 0.0005 0.08 0.1 1.7 0.007]';
% Sample Output
Etotal = 0.0060
0.0800
0

Approach #1: Generic case
One approach with bsxfun and matrix-multiplication -
mask = bsxfun(#ge,indx_pos,event_start.') & bsxfun(#le,indx_pos,event_end.')
Etotal = energy.'*mask
This could be a bit memory-hungry if indx_pos has lots of elements in it.
Approach #2: Non-overlapping start/end ranges case
One can use accumarray for this special case like so -
%// Setup ID array for use in accumarray later on
loc(numel(pos))=0; %// Fast pre-allocation scheme
valids = event_end+1<=numel(pos);
loc(event_end(valids)+1) = -1*(1:sum(valids));
loc(event_start) = loc(event_start)+(1:numel(event_end));
id = cumsum(loc);
%// Set elements as zeros in HitEnergy that do not satisfy the criteria:
%// pos>0.7 & pos<2.0
HitEnergy_select = (pos>0.7 & pos<2.0).*HitEnergy(:);
%// Discard elments in HitEnergy_select & id that have IDs as zeros
HitEnergy_select = HitEnergy_select(id~=0);
id = id(id~=0);
%// Accumulate summations as done inside the loop in the original code
Etotal = accumarray(id(:),HitEnergy_select);

The problem is that for every event you are searching the entire vector indx_pos.
Constrain your search inside the loop to only the range from event_start(i) to event_end(i):
for i = 1:n_events
I = event_start(i):event_end(i);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal(i) = sum(HitEnergy(I(posIIsWithinRange)));
end
You could also use a vectorized version based on run length decoding and vectorizing the notion of colon. (Download the functions coloncatrld and runLengthDecode.)
I = coloncatrld(event_start, event_end);
energy = HitEnergy(I);
eventNum = runLengthDecode(event_end - event_start+1);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal = accumarray(eventNum(posIIsWithinRange), energy(posIIsWithinRange), [n_events,1]);
This is similar to Divakar's Approach #2 with the addition that it should work for overlapping ranges too.

Related

Random sampling of elements from an array based on a target condition

I have an array (let's call it ElmInfo) of size Nx2 representing a geometry. In that array the element number and element volume are on the column 1 and column 2 respectively. The volume of elements largely vary. The sum of the volume of all elements leads to a value V which can be obtained in MATLAB as:
V=sum(ElmInfo(:,2));
I want to randomly sample elements from the array ElmInfo in such a way that the volume of sampled elements (with no repetition) will lead to a target volume V1. Note: V1 is less than V. So I don't know the number of elements to be sampled. I am giving an example. For a sampling case number of sampled element can be '10' whereas in other sampling number of sampled element can be '15'.
There is no straightforward MATLAB in-built function to meet the target condition. How can I implement the code in MATLAB?
Finally I got the answer of my question. Here is the solution I got from a contributor at MATLAB central. For the convenience of the stack overflow community I am posting the answer here.
TotVol=sum(ElmInfo(:,2));
DefVf = 1.5; % This is the volume fraction I want to sample
% Target sample volume
DefVolm_target = TotVol*(DefVf/100);
% **************************************
n = 300;
v = ElmInfo(:,2);
tol = 1e-6;
sample = [];
maxits = 10000;
for count = 1:maxits
p = randperm(n);
s = cumsum(v(p));
k = find(abs(s - DefVolm_target) < tol);
if ~isempty(k)
sample_indices = p(1:k(1));
sample = v(sample_indices);
fprintf('Sample found after %d iterations\n', count);
break
end
end
DefVol_sim=sum(sample);
sampled_Elm=sort(sample_indices);

Compute the product of the next n elements in array

I would like to compute the product of the next n adjacent elements of a matrix. The number n of elements to be multiplied should be given in function's input.
For example for this input I should compute the product of every 3 consecutive elements, starting from the first.
[p, ind] = max_product([1 2 2 1 3 1],3);
This gives [1*2*2, 2*2*1, 2*1*3, 1*3*1] = [4,4,6,3].
Is there any practical way to do it? Now I do this using:
for ii = 1:(length(v)-2)
p = prod(v(ii:ii+n-1));
end
where v is the input vector and n is the number of elements to be multiplied.
in this example n=3 but can take any positive integer value.
Depending whether n is odd or even or length(v) is odd or even, I get sometimes right answers but sometimes an error.
For example for arguments:
v = [1.35912281237829 -0.958120385352704 -0.553335935098461 1.44601450110386 1.43760259196739 0.0266423803393867 0.417039432979809 1.14033971399183 -0.418125096873537 -1.99362640306847 -0.589833539347417 -0.218969651537063 1.49863539349242 0.338844452879616 1.34169199365703 0.181185490389383 0.102817336496793 0.104835620599133 -2.70026800170358 1.46129128974515 0.64413523430416 0.921962619821458 0.568712984110933]
n = 7
I get the error:
Index exceeds matrix dimensions.
Error in max_product (line 6)
p = prod(v(ii:ii+n-1));
Is there any correct general way to do it?
Based on the solution in Fast numpy rolling_product, I'd like to suggest a MATLAB version of it, which leverages the movsum function introduced in R2016a.
The mathematical reasoning is that a product of numbers is equal to the exponent of the sum of their logarithms:
A possible MATLAB implementation of the above may look like this:
function P = movprod(vec,window_sz)
P = exp(movsum(log(vec),[0 window_sz-1],'Endpoints','discard'));
if isreal(vec) % Ensures correct outputs when the input contains negative and/or
P = real(P); % complex entries.
end
end
Several notes:
I haven't benchmarked this solution, and do not know how it compares in terms of performance to the other suggestions.
It should work correctly with vectors containing zero and/or negative and/or complex elements.
It can be easily expanded to accept a dimension to operate along (for array inputs), and any other customization afforded by movsum.
The 1st input is assumed to be either a double or a complex double row vector.
Outputs may require rounding.
Update
Inspired by the nicely thought answer of Dev-iL comes this handy solution, which does not require Matlab R2016a or above:
out = real( exp(conv(log(a),ones(1,n),'valid')) )
The basic idea is to transform the multiplication to a sum and a moving average can be used, which in turn can be realised by convolution.
Old answers
This is one way using gallery to get a circulant matrix and indexing the relevant part of the resulting matrix before multiplying the elements:
a = [1 2 2 1 3 1]
n = 3
%// circulant matrix
tmp = gallery('circul', a(:))
%// product of relevant parts of matrix
out = prod(tmp(end-n+1:-1:1, end-n+1:end), 2)
out =
4
4
6
3
More memory efficient alternative in case there are no zeros in the input:
a = [10 9 8 7 6 5 4 3 2 1]
n = 2
%// cumulative product
x = [1 cumprod(a)]
%// shifted by n and divided by itself
y = circshift( x,[0 -n] )./x
%// remove last elements
out = y(1:end-n)
out =
90 72 56 42 30 20 12 6 2
Your approach is correct. You should just change the for loop to for ii = 1:(length(v)-n+1) and then it will work fine.
If you are not going to deal with large inputs, another approach is using gallery as explained in #thewaywewalk's answer.
I think the problem may be based on your indexing. The line that states for ii = 1:(length(v)-2) does not provide the correct range of ii.
Try this:
function out = max_product(in,size)
size = size-1; % this is because we add size to i later
out = zeros(length(in),1) % assuming that this is a column vector
for i = 1:length(in)-size
out(i) = prod(in(i:i+size));
end
Your code works when restated like so:
for ii = 1:(length(v)-(n-1))
p = prod(v(ii:ii+(n-1)));
end
That should take care of the indexing problem.
using bsxfun you create a matrix each row of it contains consecutive 3 elements then take prod of 2nd dimension of the matrix. I think this is most efficient way:
max_product = #(v, n) prod(v(bsxfun(#plus, (1 : n), (0 : numel(v)-n)')), 2);
p = max_product([1 2 2 1 3 1],3)
Update:
some other solutions updated, and some such as #Dev-iL 's answer outperform others, I can suggest fftconv that in Octave outperforms conv
If you can upgrade to R2017a, you can use the new movprod function to compute a windowed product.

What is the fastest way to count elements in an array?

In my models, one of the most repeated tasks to be done is counting the number of each element within an array. The counting is from a closed set, so I know there are X types of elements, and all or some of them populate the array, along with zeros that represent 'empty' cells. The array is not sorted in any way, and could by quite long (about 1M elements), and this task is done thousands of times during one simulation (which is also part of hundreds of simulations). The result should be a vector r of size X, so r(k) is the amount of k in the array.
Example:
For X = 9, if I have the following input vector:
v = [0 7 8 3 0 4 4 5 3 4 4 8 3 0 6 8 5 5 0 3]
I would like to get this result:
r = [0 0 4 4 3 1 1 3 0]
Note that I don't want the count of zeros, and that elements that don't appear in the array (like 2) have a 0 in the corresponding position of the result vector (r(2) == 0).
What would be the fastest way to achieve this goal?
tl;dr: The fastest method depend on the size of the array. For array smaller than 214 method 3 below (accumarray) is faster. For arrays larger than that method 2 below (histcounts) is better.
UPDATE: I tested this also with implicit broadcasting, that was introduced in 2016b, and the results are almost equal to the bsxfun approach, with no significant difference in this method (relative to the other methods).
Let's see what are the available methods to perform this task. For the following examples we will assume X has n elements, from 1 to n, and our array of interest is M, which is a column array that can vary in size. Our result vector will be spp1, such that spp(k) is the number of ks in M. Although I write here about X, there is no explicit implementation of it in the code below, I just define n = 500 and X is implicitly 1:500.
The naive for loop
The most simple and straightforward way to cope this task is by a for loop that iterate over the elements in X and count the number of elements in M that equal to it:
function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k) = sum(M==k);
end
end
This is off course not so smart, especially if only little group of elements from X is populating M, so we better look first for those that are already in M:
function spp = uloop(M,n)
u = unique(M); % finds which elements to count
spp = zeros(n,1);
for k = u(u>0).';
spp(k) = sum(M==k);
end
end
Usually, in MATLAB, it is advisable to take advantage of the built-in functions as much as possible, since most of the times they are much faster. I thought of 5 options to do so:
1. The function tabulate
The function tabulate returns a very convenient frequency table that at first sight seem to be the perfect solution for this task:
function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
tab(1,:) = [];
end
end
The only fix to be done is to remove the first row of the table if it counts the 0 element (it could be that there are no zeros in M).
2. The function histcounts
Another option that can be tweaked quite easily to our need it histcounts:
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
here, in order to count all different elements between 1 to n separately, we define the edges to be 1:n+1, so every element in X has it's own bin. We could write also histcounts(M(M>0),'BinMethod','integers'), but I already tested it, and it takes more time (though it makes the function independent of n).
3. The function accumarray
The next option I'll bring here is the use of the function accumarray:
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
here we give the function M(M>0) as input, to skip the zeros, and use 1 as the vals input to count all unique elements.
4. The function bsxfun
We can even use binary operation #eq (i.e. ==) to look for all elements from each type:
function spp = bsxi(M,n)
spp = bsxfun(#eq,M,1:n);
spp = sum(spp,1);
end
if we keep the first input M and the second 1:n in different dimensions, so one is a column vector the other is a row vector, then the function compares each element in M with each element in 1:n, and create a length(M)-by-n logical matrix than we can sum to get the desired result.
5. The function ndgrid
Another option, similar to the bsxfun, is to explicitly create the two matrices of all possibilities using the ndgrid function:
function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end
then we compare them and sum over columns, to get the final result.
Benchmarking
I have done a little test to find the fastest method from all mentioned above, I defined n = 500 for all trails. For some (especially the naive for) there is a great impact of n on the time of execution, but this is not the issue here since we want to test it for a given n.
Here are the results:
We can notice several things:
Interestingly, there is a shift in the fastest method. For arrays smaller than 214 accumarray is the fastest. For arrays larger than 214 histcounts is the fastest.
As expected the naive for loops, in both versions are the slowest, but for arrays smaller than 28 the "unique & for" option is slower. ndgrid become the slowest in arrays bigger than 211, probably because of the need to store very large matrices in memory.
There is some irregularity in the way tabulate works on arrays in size smaller than 29. This result was consistent (with some variation in the pattern) in all the trials I conducted.
(the bsxfun and ndgrid curves are truncated because it makes my computer stuck in higher values, and the trend is quite clear already)
Also, notice that the y-axis is in log10, so a decrease in unit (like for arrays in size 219, between accumarray and histcounts) means a 10-times faster operation.
I'll be glad to hear in the comments for improvements to this test, and if you have another, conceptually different method, you are most welcome to suggest it as an answer.
The code
Here are all the functions wrapped in a timing function:
function out = timing_hist(N,n)
M = randi([0 n],N,1);
func_times = {'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid';
timeit(#() loop(M,n)),...
timeit(#() uloop(M,n)),...
timeit(#() tabi(M)),...
timeit(#() histci(M,n)),...
timeit(#() accumi(M)),...
timeit(#() bsxi(M,n)),...
timeit(#() gridi(M,n))};
out = cell2mat(func_times(2,:));
end
function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k) = sum(M==k);
end
end
function spp = uloop(M,n)
u = unique(M);
spp = zeros(n,1);
for k = u(u>0).';
spp(k) = sum(M==k);
end
end
function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
tab(1,:) = [];
end
end
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
function spp = bsxi(M,n)
spp = bsxfun(#eq,M,1:n);
spp = sum(spp,1);
end
function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end
And here is the script to run this code and produce the graph:
N = 25; % it is not recommended to run this with N>19 for the `bsxfun` and `ndgrid` functions.
func_times = zeros(N,5);
for n = 1:N
func_times(n,:) = timing_hist(2^n,500);
end
% plotting:
hold on
mark = 'xo*^dsp';
for k = 1:size(func_times,2)
plot(1:size(func_times,1),log10(func_times(:,k).*1000),['-' mark(k)],...
'MarkerEdgeColor','k','LineWidth',1.5);
end
hold off
xlabel('Log_2(Array size)','FontSize',16)
ylabel('Log_{10}(Execution time) (ms)','FontSize',16)
legend({'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid'},...
'Location','NorthWest','FontSize',14)
grid on
1 The reason for this weird name comes from my field, Ecology. My models are a cellular-automata, that typically simulate individual organisms in a virtual space (the M above). The individuals are of different species (hence spp) and all together form what is called "ecological community". The "state" of the community is given by the number of individuals from each species, which is the spp vector in this answer. In this models, we first define a species pool (X above) for the individuals to be drawn from, and the community state take into account all species in the species pool, not only those present in M
We know that that the input vector always contains integers, so why not use this to "squeeze" a bit more performance out of the algorithm?
I've been experimenting with some optimizations of the the two best binning methods suggested by the OP, and this is what I came up with:
The number of unique values (X in the question, or n in the example) should be explicitly converted to an (unsigned) integer type.
It's faster to compute an extra bin and then discard it, than to "only process" valid values (see the accumi_new function below).
This function takes about 30sec to run on my machine. I'm using MATLAB R2016a.
function q38941694
datestr(now)
N = 25;
func_times = zeros(N,4);
for n = 1:N
func_times(n,:) = timing_hist(2^n,500);
end
% Plotting:
figure('Position',[572 362 758 608]);
hP = plot(1:n,log10(func_times.*1000),'-o','MarkerEdgeColor','k','LineWidth',2);
xlabel('Log_2(Array size)'); ylabel('Log_{10}(Execution time) (ms)')
legend({'histcounts (double)','histcounts (uint)','accumarray (old)',...
'accumarray (new)'},'FontSize',12,'Location','NorthWest')
grid on; grid minor;
set(hP([2,4]),'Marker','s'); set(gca,'Fontsize',16);
datestr(now)
end
function out = timing_hist(N,n)
% Convert n into an appropriate integer class:
if n < intmax('uint8')
classname = 'uint8';
n = uint8(n);
elseif n < intmax('uint16')
classname = 'uint16';
n = uint16(n);
elseif n < intmax('uint32')
classname = 'uint32';
n = uint32(n);
else % n < intmax('uint64')
classname = 'uint64';
n = uint64(n);
end
% Generate an input:
M = randi([0 n],N,1,classname);
% Time different options:
warning off 'MATLAB:timeit:HighOverhead'
func_times = {'histcounts (double)','histcounts (uint)','accumarray (old)',...
'accumarray (new)';
timeit(#() histci(double(M),double(n))),...
timeit(#() histci(M,n)),...
timeit(#() accumi(M)),...
timeit(#() accumi_new(M))
};
out = cell2mat(func_times(2,:));
end
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
function spp = accumi_new(M)
spp = accumarray(M+1,1);
spp = spp(2:end);
end

Pivot to binary matrix from categorial array

I have an array with some values that belongs to a set. I would like to transform this array in a binary matrix, each column of this matrix will represent each possible value of the set, the row value is 1 for the column that matches the input array or 0 for all the others. I think a name for that is something like a binary pivot.
The input array is a column of a table type
Example of input array (The previous example were only capital letters, which led to misinterpretation):
'Apple'
'Banana'
'Cherry'
'Dragonfruit'
'Apple'
'Cherry'
So, in this example input could assume 4 different values: 'Apple', 'Banana', 'Cherry' or 'Dragonfruit', in my real scenario it can be more than 4.
Example Output matrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 0
0 0 1 0
I have achieved this desired behavior, but I would like to know if there is a better way to perform this operation. In a vectorized way (without the for-loop for each category) or using a built-in function.
function [ binMatrix, categs ] = pivotToBinaryMatrix( input )
categorizedInput = categorical(input);
categs = categories(categorizedInput);
binMatrix = zeros(size(atributo, 1), size(categorias, 1));
for i = 1: size(caters,1)
binMatrix(:,i) = ismember(categorizedInput, categs(i));
end
end
For about 50.000 entries with 9 categories it performed in 0.075137 seconds.
EDIT: I've improved the examples, because the previous examples led to misinterpretation.
Here's my take on the problem:
input = ['ABCDAB']';
binMatrix = bsxfun(#eq,input,unique(input)');
For the benchmarking, I ran it on a Windows 7 machine, 4Gb RAM, Intel i7-2600 CPU 3.4 GHz, borrowing #rayryeng initialization code:
% Generate dictionary from A up to I
ch = char(65 + (0:8));
rng(123);
% Generate 50000 random characters
v = randi(9, 50000, 1);
inputArray = ch(v);
time=0;
for ii=1:100
tic;
binMatrix = bsxfun(#eq,inputArray,unique(inputArray)');
t = toc;
time=time+t;
end
disp(time/100);
Which gave me 0.001203 seconds. For an extensive comparison of methods, please refer to #ryaryeng's answer.
I'm going to assume that your input array is a cell array of characters like so:
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
You can convert the above into a numeric array by using the unique function's third output. What's great about this is that unique assigns a unique ID in sorted order, and so if you have a cell array of characters, it respects a lexicographical ordering of the characters.
Next, declare a matrix of zeros (like you did above) then use sub2ind to index into the matrix and set the values to 1.
Something like this. Bear in mind that I initialized the output slightly differently. It's a trick I learned to allocate a matrix of zeroes that is quite fast. See here: Faster way to initialize arrays via empty matrix multiplication? (Matlab)
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix(sub2ind(size(binMatrix), 1:numel(inputArray), inputNum)) = 1;
Another method would be to create a sparse logical array where we set the right row and column positions to be 1, then use this to index into our zeroes array and set the values accordingly.
Something like:
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
binMatrix = sparse(1:numel(inputArray), inputNum, 1, numel(inputArray), max(inputNum));
binMatrix = full(binMatrix);
Let's put this all together in a timing script. I've incorporated the two methods above, plus your old method, plus Divakar's (only the first method) and brodroll's (very ingenious btw) method. For Divakar's and brodroll's method, I have also used unique with the third output as your original inquiry had capital letters which confused as all. Using the third output can easily convert their previous methods to your new specifications.
BTW, your example and your code are mismatched. Your example has it set so the each column is an index but it's each row. For the timing tests, I'm going to transpose your result.I'm running MATLAB R2013a on Mac OS X 10.10.3 with 16 GB of RAM and an Intel i7 2.3 GHz processor. So:
clear all;
close all;
%// Generate dictionary
chars = {'Apple', 'Banana', 'Cherry', 'Dragonfruit'};
rng(123);
%// Generate 50000 random words
v = randi(numel(chars), 50000, 1);
inputArray = chars(v);
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
%// Timing #1 - sub2ind
tic;
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix(sub2ind(size(binMatrix), 1:numel(inputArray), inputNum)) = 1;
t = toc;
clear binMatrix;
%// Timing #2 - sparse
tic;
binMatrix = sparse(1:numel(inputArray), inputNum, 1, numel(inputArray), max(inputNum));
binMatrix = full(binMatrix);
t2 = toc;
clear binMatrix;
%// Timing #3 - ismember and for
tic;
binMatrix = zeros(numel(inputArray), numel(chars));
for i = 1: size(binMatrix,1)
binMatrix(i,:) = ismember(chars, inputArray(i));
end
t3 = toc;
%// Timing #4 - bsxfun
clear binMatrix;
tic;
binMatrix = bsxfun(#eq,inputNum',unique(inputNum)); %// Changed to make dimensions match
t4 = toc;
clear binMatrix;
%// Timing #5 - raw sub2ind
tic;
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix( (inputNum-1)*size(binMatrix,1) + [1:numel(inputArray)] ) = 1;
t5 = toc;
fprintf('Timing using sub2ind: %f seconds\n', t);
fprintf('Timing using sparse: %f seconds\n', t2);
fprintf('Timing using ismember and loop: %f seconds\n', t3);
fprintf('Timing using bsxfun: %f seconds\n', t4);
fprintf('Timing using raw sub2ind: %f seconds\n', t5);
We get:
Timing using sub2ind: 0.004223 seconds
Timing using sparse: 0.004252 seconds
Timing using ismember and loop: 2.771389 seconds
Timing using bsxfun: 0.020739 seconds
Timing using raw sub2ind: 0.000773 seconds
In terms of rank:
Raw sub2ind
sub2ind
sparse
bsxfun
OP's method
If you don't mind all zeros columns in cases where you have non-successive characters in the input array, something like 'ABEACF', where 'D' is missing, you can use this -
col_idx = inputArray - 'A' + 1;
binMatrix(numel(inputArray), max(col_idx) ) = 0;
binMatrix( (col_idx-1)*size(binMatrix,1) + [1:numel(inputArray)] ) = 1;
If you do care about that issue and would like no all-zeros columns, you can use a modified version of it -
[~,unq_pos,col_idx] = unique(inputArray,'stable');
binMatrix(numel(inputArray), numel(unq_pos)) = 0;
binMatrix( (col_idx-1)*size(binMatrix,1) + [1:numel(inputArray)].' ) = 1;
Basically both these approaches use the same hacky technique to pre-allocate as listed in Undocumented MATLAB and also listed in the other answer by #rayryeng. On top of it, it uses a raw version of sub2ind.

Fastest way to find unique values in an array

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.
Right now I have two solutions:
result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1
dataArray is equivalent to :
dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.
So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2].
The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?
EDIT
I wanted to time the various solutions.
clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
FCT3 = unique(dataArray(:, 1));
FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
FCT32 = unique(dataArray(:, 1));
FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
FCT5 = FCT5(FCT5>0);
end
toc
And the results:
Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution
However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.
Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:
a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;
Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!
Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.
Original suggestion: Combine unique with setdiff:
result = setdiff(unique(a(:,1)),0)
Or remove with logical indexing after unique:
result = unique(a(:,1));
result = result(result>0);
I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.
Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).
Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:
a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc
display(b1)
display(b2)
display(b3)
On my machine, the timings (for an array of 100000 elements) were as follows:
0.0087 s - for unique
0.0142 s - for find(sparse)
0.0302 s = for setxor
I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.
EDIT per #chappj's suggestion. I added a fourth option
b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];
Time:
0.0029 s , THREE TIMES FASTER THAN UNIQUE
Ladies and gentlemen, we have a winner.
AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did
a = 0.001 * double(int(a * 1000));
You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do
a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);
This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)
Why not remove the zeros as a second step:
result2 = unique(.....);
result2 = (result2~=0);
I also found another way to do it :
result2 = unique(dataArray(dataArray(:,1)>0,1));

Resources