Cluster analysis on a 1D vector - arrays

Consider the following data:
A = [-1 -1 -1 0 1 -1 -1 0 0 1 1 1 1 -1 1 0 1];
How can the size and appearance frequency of clusters in A (of similar neighbors) be calculated, preferably using MATLAB built in commands?
The result should read something like
s_plus = [1 2 3 4 5 ; 3 0 0 1 0]'; % accounts (1,1,1,1) and (1),(1),(1) which appear in A
s_zero = [1 2 3 4 5 ; 2 1 0 0 0]'; % accounts (0,0) and (0),(0) which appear in A
s_mins = [1 2 3 4 5 ; 1 1 1 0 0]'; % accounts (-1), (-1,-1) , and (-1,-1,-1)) which appear in A
in the above the first column indicates the cluster size and the second column is the appearance frequency.

You can use run length encoding to transform your input array into two arrays
The value of a group (or "run" of equal values)
The number of elements in that group
Then you can covert this into your desired output by checking when two conditions are true
The values array matches the value you want (-1,0,1)
The group size matches 1..5
This might sound a bit tricky but it's only a few lines of code, and should be relatively fast for even large arrays because the outputs are calculated from the "encoded" arrays which will be smaller than the input array.
Here is the code, see the comments for details:
A = [-1 -1 -1 0 1 -1 -1 0 0 1 1 1 1 -1 1 0 1]; % Example input
% Run length encoding step
idx = [ find( A(1:end-1) ~= A(2:end) ), numel(A) ]; % Find group start points
count = diff([0, idx]); % Find number of elements in each group
val = A( idx ); % Get value of each group
% Helper function to go from "val" and "count" to desired output format
% by checking value = target and group size matches 1 to 5, counting matching groups.
f = #(v) sum(val==v & count==(1:5).',2).';
% Create outputs
s_plus = f(1); % = [3 0 0 1 0]
s_zero = f(0); % = [2 1 0 0 0]
s_mins = f(-1); % = [1 1 1 0 0]

Related

Find the middle value in array that meets condition

I've got logical array(zeros and ones) 1500x700
I want to find "1" in every column and when there are more than one "1" in column i should choose the middle one.
Is that possible to do it? I know how to find "1", but don't know how to extract the middle "1" if there's couple of "1" in one column.
The find function returns the indices of your ones.
>> example=[1,0,0,1,0,1,1];
>> indices=find(example)
indices =
1 4 6 7
>> indices(floor(numel(indices)/2))
ans =
4
Do this for each column and you have a solution.
You can
Get the row and column indices of ones with find;
Apply accumarray with a custom function to get the middle row index for each column.
x = [1 0 0 0 0; 0 0 1 0 0; 1 0 1 0 0; 1 0 0 1 0]; % example
[ii, jj] = find(x); % step 1
result = accumarray(jj, ii, [size(x,2) 1], #(x) x(ceil(end/2)), NaN); % step 2
Note that:
For an even number of ones this gives the first of the two middle indices. If you prefer the average of the two middle indices replace #(x) x(ceil(end/2)) by #median.
For a column without ones this gives NaN as result. If you prefer a different value, replace the input fifth argument of accumarray by that.
Example:
x =
1 0 0 0 0
0 0 1 0 0
1 0 1 0 0
1 0 0 1 0
result =
3
NaN
2
4
NaN

Number 0's and 1's blocks in a binary vector

In MATLAB, there is the bwlabel function, that given a binary vector, for instance x=[1 1 0 0 0 1 1 0 0 1 1 1 0] gives (bwlabel(x)):
[1 1 0 0 0 2 2 0 0 3 3 3 0]
but what I want to obtain is
[1 1 2 2 2 3 3 4 4 5 5 5 6]
I know I can negate x to obtain (bwlabel(~x))
[0 0 1 1 1 0 0 2 2 0 0 0 3]
But how can I combine them?
All in one line:
y = cumsum([1,abs(diff(x))])
Namely, abs(diff(x)) spots changes in the binary vector, and you gain the output with the cumulative sum.
You can still do it using bwlabel by vertically concatenating x and ~x, using 4-connected components for the labeling, then taking the maximum down each column:
>> max(bwlabel([x; ~x], 4))
ans =
1 1 2 2 2 3 3 4 4 5 5 5 6
However, the solution from Bentoy13 is probably a bit faster.
x=[1 1 0 0 0 1 1 0 0 1 1 1 0];
A = bwlabel(x);
B = bwlabel(~x);
if x(1)==1
tmp = A>0;
A(tmp) = 2*A(tmp)-1;
tmp = B>0;
B(tmp) = 2*B(tmp);
C = A+B
elseif x(1)==0
tmp = A>0;
A(tmp) = 2*A(tmp);
tmp = B>1;
B(tmp) = 2*B(tmp)-1;
C = A+B
end
C =
1 1 2 2 2 3 3 4 4 5 5 5 6
You know the first index should remain 1, but the second index should go from 1 to 2, the third from 2 to 3 etc; thus even indices should be doubled and odd indices should double minus one. This is given by A+A-1 for odd entries, and B+B for even entries. So a simple check for whether A or B contains the even points is sufficient, and then simply add the two arrays.
I found this function that does exactly what i wanted:
https://github.com/davidstutz/matlab-multi-label-connected-components
So, clone the repository and compile in matlab using mex :
mex sp_fast_connected_relabel.cpp
Then,
labels = sp_fast_connected_relabel(x);

Find where condition is true n times consecutively

I have an array (say of 1s and 0s) and I want to find the index, i, for the first location where 1 appears n times in a row.
For example,
x = [0 0 1 0 1 1 1 0 0 0] ;
i = 5, for n = 3, as this is the first time '1' appears three times in a row.
Note: I want to find where 1 appears n times in a row so
i = find(x,n,'first');
is incorrect as this would give me the index of the first n 1s.
It is essentially a string search? eg findstr but with a vector.
You can do it with convolution as follows:
x = [0 0 1 0 1 1 1 0 0 0];
N = 3;
result = find(conv(x, ones(1,N), 'valid')==N, 1)
How it works
Convolve x with a vector of N ones and find the first time the result equals N. Convolution is computed with the 'valid' flag to avoid edge effects and thus obtain the correct value for the index.
Another answer that I have is to generate a buffer matrix where each row of this matrix is a neighbourhood of overlapping n elements of the array. Once you create this, index into your array and find the first row that has all 1s:
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
%// Solution
ind = bsxfun(#plus, (1:numel(x)-n+1).', 0:n-1); %'
out = find(all(x(ind),2), 1);
The first line is a bit tricky. We use bsxfun to generate a matrix of size m x n where m is the total number of overlapping neighbourhoods while n is the size of the window you are searching for. This generates a matrix where the first row is enumerated from 1 to n, the second row is enumerated from 2 to n+1, up until the very end which is from numel(x)-n+1 to numel(x). Given n = 3, we have:
>> ind
ind =
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
8 9 10
These are indices which we will use to index into our array x, and for your example it generates the following buffer matrix when we directly index into x:
>> x = [0 0 1 0 1 1 1 0 0 0];
>> x(ind)
ans =
0 0 1
0 1 0
1 0 1
0 1 1
1 1 1
1 1 0
1 0 0
0 0 0
Each row is an overlapping neighbourhood of n elements. We finally end by searching for the first row that gives us all 1s. This is done by using all and searching over every row independently with the 2 as the second parameter. all produces true if every element in a row is non-zero, or 1 in our case. We then combine with find to determine the first non-zero location that satisfies this constraint... and so:
>> out = find(all(x(ind), 2), 1)
out =
5
This tells us that the fifth location of x is where the beginning of this duplication occurs n times.
Based on Rayryeng's approach you can loop this as well. This will definitely be slower for short array sizes, but for very large array sizes this doesn't calculate every possibility, but stops as soon as the first match is found and thus will be faster. You could even use an if statement based on the initial array length to choose whether to use the bsxfun or the for loop. Note also that for loops are rather fast since the latest MATLAB engine update.
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
break
end
end
Additionally, this can be used to find the a first occurrences:
x = [0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
a = 2; %// number of desired matches
collect(1,a)=0; %// initialise output
kk = 1; %// initialise counter
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
collect(kk) = idx;
if kk == a
break
end
kk = kk+1;
end
end
Which does the same but shuts down after a matches have been found. Again, this approach is only useful if your array is large.
Seeing you commented whether you can find the last occurrence: yes. Same trick as before, just run the loop backwards:
for idx = numel(x)-n:-1:1
if all(x(idx:idx+n-1))
break
end
end
One possibility with looping:
i = 0;
n = 3;
for idx = n : length(x)
idx_true = 1;
for sub_idx = (idx - n + 1) : idx
idx_true = idx_true & (x(sub_idx));
end
if(idx_true)
i = idx - n + 1;
break
end
end
if (i == 0)
disp('No index found.')
else
disp(i)
end

Change Random Element of a Matrix with small Restriction

I have a Vector 1xm, ShortMemory (SM), and a Matrix nxm, Agenda (AG).
SM only have positive integers and zeros.
Each column of AG only has one element equal to 1 and all the other elements of the same column equal to 0.
My objective is changing the position of the number 1 from a randomly choosen column from AG. The problem is that only columns that have a corresponding 0 in SM can be changed.
Example:
SM = [1 0 2];
AG = [1 1 0 ; 0 0 1 ; 0 0 0];
Randomly Generated number here
RandomColumn = 2;
The possible outcomes would be
AG = [1 0 0 ; 0 1 1 ; 0 0 0]; or AG = [1 0 0; 0 0 1 ; 0 1 0]; or AG = [1 1 0 ; 0 0 1 ; 0 0 0];
The Line that gets the 1 is also random but that's easy to do
I could do it by just getting random numbers between 1 and m but m can be very big in my problem and the number of zeros can be very small too, so it could potentially take alot of time. I could also do it with a cycle but it's Matlab and this is embeded on double cycle already.
Thanks
edit: Added commentary to the code for clarity.
edit: Corrected an error on possible outcome
My solution is based on following assumptions:
Objective is changing the position of the number 1 from a randomly choosen column from AG.
Only columns that have a corresponding 0 in SM can be changed.
Solution:
% input
SM = [1 0 2]
AG = [1 1 0 ; 0 0 1 ; 0 0 0]
% generating random column according to assumptions 1 and 2
RandomColumn1 = 1:size(AG,2);
RandomColumn1(SM~=0)=[];
RandomColumn1=RandomColumn1(randperm(length(RandomColumn1)));
RandomColumn=RandomColumn1(1);
% storing the current randomly chosen column before changing
tempColumn=AG(:,RandomColumn);
% shuffling the position of 1
AG(:,RandomColumn)=AG(randperm(size(AG,1)),RandomColumn);
% following checks if the column has remained same after shuffling. This while loop should execute (extremely) rarely.
while tempColumn==AG(:,RandomColumn)
tempColumn=AG(:,RandomColumn);
AG(:,RandomColumn)=AG(randperm(size(AG,1)),RandomColumn);
end
AG

A question about matrix manipulation

Given a 1*N matrix or an array, how do I find the first 4 elements which have the same value and then store the index for those elements?
PS:
I'm just curious. What if we want to find the first 4 elements whose value differences are within a certain range, say below 2? For example, M=[10,15,14.5,9,15.1,8.5,15.5,9.5], the elements I'm looking for will be 15,14.5,15.1,15.5 and the indices will be 2,3,5,7.
If you want the first value present 4 times in the array 'tab' in Matlab, you can use
num_min = 4
val=NaN;
for i = tab
if sum(tab==i) >= num_min
val = i;
break
end
end
ind = find(tab==val, num_min);
By instance with
tab = [2 4 4 5 4 6 4 5 5 4 6 9 5 5]
you get
val =
4
ind =
2 3 5 7
Here is my MATLAB solution:
array = randi(5, [1 10]); %# random array of integers
n = unique(array)'; %'# unique elements
[r,~] = find(cumsum(bsxfun(#eq,array,n),2) == 4, 1, 'first');
if isempty(r)
val = []; ind = []; %# no answer
else
val = n(r); %# the value found
ind = find(array == val, 4); %# indices of elements corresponding to val
end
Example:
array =
1 5 3 3 1 5 4 2 3 3
val =
3
ind =
3 4 9 10
Explanation:
First of all, we extract the list of unique elements. In the example used above, we have:
n =
1
2
3
4
5
Then using the BSXFUN function, we compare each unique value against the entire vector array we have. This is equivalent to the following:
result = zeros(length(n),length(array));
for i=1:length(n)
result(i,:) = (array == n(i)); %# row-by-row
end
Continuing with the same example we get:
result =
1 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 1 1 0 0 0 0 1 1
0 0 0 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0 0 0
Next we call CUMSUM on the result matrix to compute the cumulative sum along the rows. Each row will give us how many times the element in question appeared so far:
>> cumsum(result,2)
ans =
1 1 1 1 2 2 2 2 2 2
0 0 0 0 0 0 0 1 1 1
0 0 1 2 2 2 2 2 3 4
0 0 0 0 0 0 1 1 1 1
0 1 1 1 1 2 2 2 2 2
Then we compare that against four cumsum(result,2)==4 (since we want the location where an element appeared for the forth time):
>> cumsum(result,2)==4
ans =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Finally we call FIND to look for the first appearing 1 according to a column-wise order: if we traverse the matrix from the previous step column-by-column, then the row of the first appearing 1 indicates the index of the element we are looking for. In this case, it was the third row (r=3), thus the third element in the unique vector is the answer val = n(r). Note that if we had multiple elements repeated 4 times or more in the original array, then the one first appearing for the forth time will show up first as a 1 going column-by-column in the above expression.
Finding the indices of the corresponding answer value is a simple call to FIND...
Here is C++ code
std::map<int,std::vector<int> > dict;
std::vector<int> ans(4);//here we will store indexes
bool noanswer=true;
//my_vector is a vector, which we must analize
for(int i=0;i<my_vector.size();++i)
{
std::vector<int> &temp = dict[my_vector[i]];
temp.push_back(i);
if(temp.size()==4)//we find ans
{
std::copy(temp.begin(),temp.end(),ans.begin() );
noanswer = false;
break;
}
}
if(noanswer)
std::cout<<"No Answer!"<<std::endl;
Ignore this and use Amro's mighty solution . . .
Here is how I'd do it in Matlab. The matrix can be any size and contain any range of values and this should work. This solution will automatically find a value and then the indicies of the first 4 elements without being fed the search value a priori.
tab = [2 5 4 5 4 6 4 5 5 4 6 9 5 5]
%this is a loop to find the indicies of groups of 4 identical elements
tot = zeros(size(tab));
for nn = 1:numel(tab)
idxs=find(tab == tab(nn), 4, 'first');
if numel(idxs)<4
tot(nn) = Inf;
else
tot(nn) = sum(idxs);
end
end
%find the first 4 identical
bestTot = find(tot == min(tot), 1, 'first' );
%store the indicies you are interested in.
indiciesOfInterst = find(tab == tab(bestTot), 4, 'first')
Since I couldn't easily understand some of the solutions, I made that one:
l = 10; m = 5; array = randi(m, [1 l])
A = zeros(l,m); % m is the maximum value (may) in array
A(sub2ind([l,m],1:l,array)) = 1;
s = sum(A,1);
b = find(s(array) == 4,1);
% now in b is the index of the first element
if (~isempty(b))
find(array == array(b))
else
disp('nothing found');
end
I find this easier to visualize. It fills '1' in all places of a square matrix, where values in array exist - according to their position (row) and value (column). This is than summed up easily and mapped to the original array. Drawback: if array contains very large values, A may get relative large too.
You're PS question is more complicated. I didn't have time to check each case but the idea is here :
M=[10,15,14.5,9,15.1,8.5,15.5,9.5]
val = NaN;
num_min = 4;
delta = 2;
[Ms, iMs] = sort(M);
dMs = diff(Ms);
ind_min=Inf;
n = 0;
for i = 1:length(dMs)
if dMs(i) <= delta
n=n+1;
else
n=0;
end
if n == (num_min-1)
if (iMs(i) < ind_min)
ind_min = iMs(i);
end
end
end
ind = sort(iMs(ind_min + (0:num_min-1)))
val = M(ind)

Resources