Fastest way to find unique values in an array - arrays

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.
Right now I have two solutions:
result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1
dataArray is equivalent to :
dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.
So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2].
The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?
EDIT
I wanted to time the various solutions.
clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
FCT3 = unique(dataArray(:, 1));
FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
FCT32 = unique(dataArray(:, 1));
FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
FCT5 = FCT5(FCT5>0);
end
toc
And the results:
Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution
However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:
a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;
Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!
Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.
Original suggestion: Combine unique with setdiff:
result = setdiff(unique(a(:,1)),0)
Or remove with logical indexing after unique:
result = unique(a(:,1));
result = result(result>0);
I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.
Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:
a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc
display(b1)
display(b2)
display(b3)
On my machine, the timings (for an array of 100000 elements) were as follows:
0.0087 s - for unique
0.0142 s - for find(sparse)
0.0302 s = for setxor
I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.
EDIT per #chappj's suggestion. I added a fourth option
b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];
Time:
0.0029 s , THREE TIMES FASTER THAN UNIQUE
Ladies and gentlemen, we have a winner.
AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did
a = 0.001 * double(int(a * 1000));
You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do
a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);
This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)

Why not remove the zeros as a second step:
result2 = unique(.....);
result2 = (result2~=0);

I also found another way to do it :
result2 = unique(dataArray(dataArray(:,1)>0,1));

Related

How to increment some of elements in an array by specific values in MATLAB

Suppose we have an array
A = zeros([1,10]);
We have several indexes with possible duplicate say:
indSeq = [1,1,2,3,4,4,4];
How can we increase A(i) by the number of i in the index sequence i.e. A(1) = 2, A(2) = 1, A(3) = 1, A(4) = 3?
The code A(indSeq) = A(indSeq)+1 does not work.
I know that I can use the following for loop to achieve the goal, but I wonder if there is anyway that we can avoid for-loop? We can assume that the indSeq is sorted.
A for-loop solution:
for i=1:length(indSeq)
A(indSeq(i)) = A(indSeq(i))+1;
end;
You can use accumarray for such a label based counting job, like so -
accumarray(indSeq(:),1)
Benchmarking
As suggested in the other answer, you can also use hist/histc. Let's benchmark these two for a large datasize. The benchmarking code I used had -
%// Create huge random array filled with ints that are duplicated & sorted
maxn = 100000;
N = 10000000;
indSeq = sort(randi(maxn,1,N));
disp('--------------------- With HISTC')
tic,histc(indSeq,unique(indSeq));toc
disp('--------------------- With ACCUMARRAY')
tic,accumarray(indSeq(:),1);toc
Runtime output -
--------------------- With HISTC
Elapsed time is 1.028165 seconds.
--------------------- With ACCUMARRAY
Elapsed time is 0.220202 seconds.
This is run-length encoding, and the following code should do the trick for you.
A=zeros(1,10);
indSeq = [1,1,2,3,4,4,4,7,1];
indSeq=sort(indSeq); %// if your input is always sorted, you don't need to do this
pos = [1; find(diff(indSeq(:)))+1; numel(indSeq)+1];
A(indSeq(pos(1:end-1)))=diff(pos)
which returns
A =
3 1 1 3 0 0 1 0 0 0
This algorithm was written by Luis Mendo for MATL.
I think what you are looking for is the number of occurences of unique values of the array. This can be accomplished with:
[num, val] = hist(indSeq,unique(indSeq));
the output of your example is:
num = 2 1 1 3
val = 1 2 3 4
so num is the number of times val occurs. i.e. the number 1 occurs 2 times in your example

Pivot to binary matrix from categorial array

I have an array with some values that belongs to a set. I would like to transform this array in a binary matrix, each column of this matrix will represent each possible value of the set, the row value is 1 for the column that matches the input array or 0 for all the others. I think a name for that is something like a binary pivot.
The input array is a column of a table type
Example of input array (The previous example were only capital letters, which led to misinterpretation):
'Apple'
'Banana'
'Cherry'
'Dragonfruit'
'Apple'
'Cherry'
So, in this example input could assume 4 different values: 'Apple', 'Banana', 'Cherry' or 'Dragonfruit', in my real scenario it can be more than 4.
Example Output matrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 0
0 0 1 0
I have achieved this desired behavior, but I would like to know if there is a better way to perform this operation. In a vectorized way (without the for-loop for each category) or using a built-in function.
function [ binMatrix, categs ] = pivotToBinaryMatrix( input )
categorizedInput = categorical(input);
categs = categories(categorizedInput);
binMatrix = zeros(size(atributo, 1), size(categorias, 1));
for i = 1: size(caters,1)
binMatrix(:,i) = ismember(categorizedInput, categs(i));
end
end
For about 50.000 entries with 9 categories it performed in 0.075137 seconds.
EDIT: I've improved the examples, because the previous examples led to misinterpretation.
Here's my take on the problem:
input = ['ABCDAB']';
binMatrix = bsxfun(#eq,input,unique(input)');
For the benchmarking, I ran it on a Windows 7 machine, 4Gb RAM, Intel i7-2600 CPU 3.4 GHz, borrowing #rayryeng initialization code:
% Generate dictionary from A up to I
ch = char(65 + (0:8));
rng(123);
% Generate 50000 random characters
v = randi(9, 50000, 1);
inputArray = ch(v);
time=0;
for ii=1:100
tic;
binMatrix = bsxfun(#eq,inputArray,unique(inputArray)');
t = toc;
time=time+t;
end
disp(time/100);
Which gave me 0.001203 seconds. For an extensive comparison of methods, please refer to #ryaryeng's answer.
I'm going to assume that your input array is a cell array of characters like so:
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
You can convert the above into a numeric array by using the unique function's third output. What's great about this is that unique assigns a unique ID in sorted order, and so if you have a cell array of characters, it respects a lexicographical ordering of the characters.
Next, declare a matrix of zeros (like you did above) then use sub2ind to index into the matrix and set the values to 1.
Something like this. Bear in mind that I initialized the output slightly differently. It's a trick I learned to allocate a matrix of zeroes that is quite fast. See here: Faster way to initialize arrays via empty matrix multiplication? (Matlab)
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix(sub2ind(size(binMatrix), 1:numel(inputArray), inputNum)) = 1;
Another method would be to create a sparse logical array where we set the right row and column positions to be 1, then use this to index into our zeroes array and set the values accordingly.
Something like:
inputArray = {'Apple', 'Banana', 'Cherry', 'Dragonfruit', 'Apple', 'Cherry'};
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
binMatrix = sparse(1:numel(inputArray), inputNum, 1, numel(inputArray), max(inputNum));
binMatrix = full(binMatrix);
Let's put this all together in a timing script. I've incorporated the two methods above, plus your old method, plus Divakar's (only the first method) and brodroll's (very ingenious btw) method. For Divakar's and brodroll's method, I have also used unique with the third output as your original inquiry had capital letters which confused as all. Using the third output can easily convert their previous methods to your new specifications.
BTW, your example and your code are mismatched. Your example has it set so the each column is an index but it's each row. For the timing tests, I'm going to transpose your result.I'm running MATLAB R2013a on Mac OS X 10.10.3 with 16 GB of RAM and an Intel i7 2.3 GHz processor. So:
clear all;
close all;
%// Generate dictionary
chars = {'Apple', 'Banana', 'Cherry', 'Dragonfruit'};
rng(123);
%// Generate 50000 random words
v = randi(numel(chars), 50000, 1);
inputArray = chars(v);
[~,~,inputNum] = unique(inputArray);
inputNum = inputNum.'; %// To make compatible in dimensions
%// Timing #1 - sub2ind
tic;
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix(sub2ind(size(binMatrix), 1:numel(inputArray), inputNum)) = 1;
t = toc;
clear binMatrix;
%// Timing #2 - sparse
tic;
binMatrix = sparse(1:numel(inputArray), inputNum, 1, numel(inputArray), max(inputNum));
binMatrix = full(binMatrix);
t2 = toc;
clear binMatrix;
%// Timing #3 - ismember and for
tic;
binMatrix = zeros(numel(inputArray), numel(chars));
for i = 1: size(binMatrix,1)
binMatrix(i,:) = ismember(chars, inputArray(i));
end
t3 = toc;
%// Timing #4 - bsxfun
clear binMatrix;
tic;
binMatrix = bsxfun(#eq,inputNum',unique(inputNum)); %// Changed to make dimensions match
t4 = toc;
clear binMatrix;
%// Timing #5 - raw sub2ind
tic;
binMatrix(numel(inputArray), max(inputNum)) = 0;
binMatrix( (inputNum-1)*size(binMatrix,1) + [1:numel(inputArray)] ) = 1;
t5 = toc;
fprintf('Timing using sub2ind: %f seconds\n', t);
fprintf('Timing using sparse: %f seconds\n', t2);
fprintf('Timing using ismember and loop: %f seconds\n', t3);
fprintf('Timing using bsxfun: %f seconds\n', t4);
fprintf('Timing using raw sub2ind: %f seconds\n', t5);
We get:
Timing using sub2ind: 0.004223 seconds
Timing using sparse: 0.004252 seconds
Timing using ismember and loop: 2.771389 seconds
Timing using bsxfun: 0.020739 seconds
Timing using raw sub2ind: 0.000773 seconds
In terms of rank:
Raw sub2ind
sub2ind
sparse
bsxfun
OP's method
If you don't mind all zeros columns in cases where you have non-successive characters in the input array, something like 'ABEACF', where 'D' is missing, you can use this -
col_idx = inputArray - 'A' + 1;
binMatrix(numel(inputArray), max(col_idx) ) = 0;
binMatrix( (col_idx-1)*size(binMatrix,1) + [1:numel(inputArray)] ) = 1;
If you do care about that issue and would like no all-zeros columns, you can use a modified version of it -
[~,unq_pos,col_idx] = unique(inputArray,'stable');
binMatrix(numel(inputArray), numel(unq_pos)) = 0;
binMatrix( (col_idx-1)*size(binMatrix,1) + [1:numel(inputArray)].' ) = 1;
Basically both these approaches use the same hacky technique to pre-allocate as listed in Undocumented MATLAB and also listed in the other answer by #rayryeng. On top of it, it uses a raw version of sub2ind.

Matlab: Sum corresponding values if index is within a range

I have been going crazy trying to figure a way to speed this up. Right now my current code talks ~200 sec looping over 77000 events. I was hoping someone might be able to help me speed this up because I have to do about 500 of these.
Problem:
I have arrays (both 200000x1) that correspond to Energy and Position of a hit over 77000 events. I have the range of each event separated into two arrays, event_start and event_end. First thing I do is look for the position in a specific range, then I put the correspond energy in its own array. To get what I need out of this information, I loop through each event and its corresponding start/end to sum up all the energies from each it hit. My code is below:
indx_pos = find(pos>0.7 & pos<2.0);
energy = HitEnergy(indx_pos);
for i=1:n_events
Etotal(i) = sum(energy(find(indx_pos>=event_start(i) …
& indx_pos<=event_end(i))));
end
Sample input & output:
% Sample input
% pos and energy same length
n_events = 3;
event_start = [1 3 7]';
event_end = [2 6 8]';
pos = [0.75 0.8 2.1 3.6 1.9 0.5 21.0 3.1]';
HitEnergy = [0.002 0.004 0.01 0.0005 0.08 0.1 1.7 0.007]';
% Sample Output
Etotal = 0.0060
0.0800
0
Approach #1: Generic case
One approach with bsxfun and matrix-multiplication -
mask = bsxfun(#ge,indx_pos,event_start.') & bsxfun(#le,indx_pos,event_end.')
Etotal = energy.'*mask
This could be a bit memory-hungry if indx_pos has lots of elements in it.
Approach #2: Non-overlapping start/end ranges case
One can use accumarray for this special case like so -
%// Setup ID array for use in accumarray later on
loc(numel(pos))=0; %// Fast pre-allocation scheme
valids = event_end+1<=numel(pos);
loc(event_end(valids)+1) = -1*(1:sum(valids));
loc(event_start) = loc(event_start)+(1:numel(event_end));
id = cumsum(loc);
%// Set elements as zeros in HitEnergy that do not satisfy the criteria:
%// pos>0.7 & pos<2.0
HitEnergy_select = (pos>0.7 & pos<2.0).*HitEnergy(:);
%// Discard elments in HitEnergy_select & id that have IDs as zeros
HitEnergy_select = HitEnergy_select(id~=0);
id = id(id~=0);
%// Accumulate summations as done inside the loop in the original code
Etotal = accumarray(id(:),HitEnergy_select);
The problem is that for every event you are searching the entire vector indx_pos.
Constrain your search inside the loop to only the range from event_start(i) to event_end(i):
for i = 1:n_events
I = event_start(i):event_end(i);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal(i) = sum(HitEnergy(I(posIIsWithinRange)));
end
You could also use a vectorized version based on run length decoding and vectorizing the notion of colon. (Download the functions coloncatrld and runLengthDecode.)
I = coloncatrld(event_start, event_end);
energy = HitEnergy(I);
eventNum = runLengthDecode(event_end - event_start+1);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal = accumarray(eventNum(posIIsWithinRange), energy(posIIsWithinRange), [n_events,1]);
This is similar to Divakar's Approach #2 with the addition that it should work for overlapping ranges too.

Matlab random sample of a dataset

I have a dataset (Data) which is a vector of, let's say, 1000 real numbers. I would like to extract at random from Data 100 times 10 contiguous numbers. I don't know how to use Datasample for that purpose.
Thanks in advance for you help.
You can just pick 100 random numbers between 1 and 991:
I = randi(991, 100, 1)
Then use them as the starting points to index 10 contiguous elements:
cell2mat(arrayfun(#(x)(Data(x:x+9)), I, 'uni', false))
Here you have a snipet, but instead of using Datasample, I used randi to generate random indexes.
n_times = 100;
l_data = length(Data);
index_random = randi(l_data-9,n_times,1); % '- 9' to not to surpass the vector limit when you read the 10 items
for ind1 = 1:n_times
random_number(ind1,:) = Data(index_random(ind1):index_random(ind1)+9)
end
This is similar to Dan's answer, but avoids using cells and arrayfun, so it may be faster.
Let Ns denote the number of contiguous numbers you want (10 in your example), and Nt the number of times (100 in your example). Then:
result = Data(bsxfun(#plus, randi(numel(Data)-Ns+1, Nt, 1), 0:Ns-1)); %// Nt x Ns
Here is another solution, close to #Luis, but with cumsum instead of bsxfun:
A = rand(1,1000); % The vector to sample
sz = size(A,2);
N = 100; % no. of samples
B = 10; % size of one sample
first = randi(sz-B+1,N,1); % the starting point for all blocks
rand_blocks = A(cumsum([first ones(N,B-1)],2)); % the result
This results in an N-by-B matrix (rand_blocks), each row of it is one sample. Of course, this could be one-lined, but it won't make it faster, and I want to keep it clear. For small N or B this method is slightly faster. If N or B becomes very large then the bsxfun method is slightly faster. This ranking is not affected by the size of A.

Pre-allocate an array of structures for use in genetic algorithm in Matlab

This is the code I have so far:
population = 50
individual = repmat(struct('genes',[], 'fitness', 0), population, 1);
So what I'm doing is creating a population of 50 individuals these individuals each have the component genes and fitness. What I can't seem to do correctly is set genes up to be a 50 cell array rather than just a single cell.
Can anyone shed some light on this for me please?
A further addition I'd like to make is to populate the genes array with random values (either 0 or 1). I imagine I could easily do this afterwards by iterating through the genes array of each member and using what ever random number generating functionality Matlab has available. However it would be more efficient to do when the structures are being pre-allocated.
Thanks
Why not use a class instead of a struct? Creating a simple class person:
classdef person
properties
fitness = 0;
end
properties(SetAccess = private)
genes
end
methods
function obj = person()
obj.genes = randi([0 1], 10, 1);
end
end
end
and then running the following script:
population = 50;
people = person.empty(population, 0);
people(1).fitness = 100;
people(2).fitness = 50;
people(1)
people(2)
produces the following console output:
ans =
person with properties:
fitness: 100
genes: [10x1 double]
ans =
person with properties:
fitness: 50
genes: [10x1 double]
If you are looking to allocate different random values to each individual, then doing a repmat as an allocation isn't going to help, as this just replicates the same thing 50 times. You are better off just using a simple loop:
population=50;
individual=struct('genes',[],'fitness',0);
for m=1:50
individual(m).genes=rand(1,50)>=0.5;
end
This is no less efficient than allocating all of them and then looping through - in each case the genes array is only allocated once. Moreover, allocating and reallocating 50 cells isn't going to be very slow - you will probably not notice much difference until you hit thousands or tens of thousands.
Well, keeping to structs, here's a few ways:
% Your original method
clear all
tic
population = 50;
individual = repmat(struct('genes', false(50,1), 'fitness', 0), population, 1);
toc
% simple loop
clear all
tic
population = 50;
individual(population,1) = struct('genes', false(50,1), 'fitness', 0);
for ii = 1:population
individual(ii).genes = false(50,1);
end
toc
% Third option
clear all
tic
population = 50;
individual = struct(...
'genes' , num2cell(false(50,population),2), ...
'fitness', num2cell(zeros(population,1)));
toc
Results:
Elapsed time is 0.009887 seconds. % your method
Elapsed time is 0.000475 seconds. % loop
Elapsed time is 0.013252 seconds. % init with cells
My suggestion: just use the loop :)
You can do something similar to this :
individual = repmat(struct('genes',{cell(1,50)}, 'fitness', 0), population, 1);

Resources