In matlab, find the frequency at which unique rows appear in a matrix - arrays

In Matlab, say I have the following matrix, which represents a population of 10 individuals:
pop = [0 0 0 0 0; 1 1 1 0 0; 1 1 1 1 1; 1 1 1 0 0; 0 0 0 0 0; 0 0 0 0 0; 1 0 0 0 0; 1 1 1 1 1; 0 0 0 0 0; 0 0 0 0 0];
Where rows of ones and zeros define 6 different 'types' of individuals.
a = [0 0 0 0 0];
b = [1 0 0 0 0];
c = [1 1 0 0 0];
d = [1 1 1 0 0];
e = [1 1 1 1 0];
f = [1 1 1 1 1];
I want to define the proportion/frequency of a, b, c, d, e and f in pop.
I want to end up with the following list:
a = 0.5;
b = 0.1;
c = 0;
d = 0.2;
e = 0;
f = 0.2;
One way I can think of is by summing the rows, then counting the number of times each appears, and then sorting and indexing
sum_pop = sum(pop')';
x = unique(sum_pop);
N = numel(x);
count = zeros(N,1);
for l = 1:N
count(l) = sum(sum_pop==x(l));
end
pop_frequency = [x(:) count/10];
But this doesn't quite get me what I want (i.e. when frequency = 0) and it seems there must be a faster way?

You can use pdist2 (Statistics Toolbox) to get all frequencies:
indiv = [a;b;c;d;e;f]; %// matrix with all individuals
result = mean(pdist2(pop, indiv)==0, 1);
This gives, in your example,
result =
0.5000 0.1000 0 0.2000 0 0.2000
Equivalently, you can use bsxfun to manually compute pdist2(pop, indiv)==0, as in Divakar's answer.
For the specific individuals in your example (that can be identified by the number of ones) you could also do
result = histc(sum(pop, 2), 0:size(pop,2)) / size(pop,1);

There is some functionality in unique that can be used for this. If
[q,w,e] = unique(pop,'rows');
q is the matrix of unique rows, w is the index of the row first appears in the matrix. The third element e contains indices of q so that pop = q(e,:). Armed with this, the rest of the problem should be straight forward. The probability of a value in e should be the probability that this row appears in pop.
The counting can be done with histc
histc(e,1:max(e))/length(e)
and the non occuring rows can be found with
ismember(a,q,'rows')
There is of course other ways as well, maybe (probably) faster ways, or oneliners. Why I post this is because it provides a way that is easy to understand, readable and that does not require any special toolboxes.
EDIT
This example gives expected output
a = [0,0,0,0,0;1,0,0,0,0;1,1,0,0,0;1,1,1,0,0;1,1,1,1,0;1,1,1,1,1]; % catenated a-f
[q,w,e] = unique(pop,'rows');
prob = histc(e,1:max(e))/length(e);
out = zeros(size(a,1),1);
out(ismember(a,q,'rows')) = prob;

Approach #1
With bsxfun -
A = cat(1,a,b,c,d,e,f)
out = squeeze(sum(all(bsxfun(#eq,pop,permute(A,[3 2 1])),2),1))/size(pop,1)
Output -
out =
0.5000
0.1000
0
0.2000
0
0.2000
Approach #2
If those elements are binary numbers, you can convert them into decimal format.
Thus, decimal format for pop becomes -
>> bi2de(pop)
ans =
0
7
31
7
0
0
1
31
0
0
And that of the concatenated array, A becomes -
>> bi2de(A)
ans =
0
1
3
7
15
31
Finally, you need to count the decimal formatted numbers from A in that of pop, which you can do with histc. Here's the code -
A = cat(1,a,b,c,d,e,f)
out = histc(bi2de(pop),bi2de(A))/size(pop,1)
Output -
out =
0.5000
0.1000
0
0.2000
0
0.2000

I think ismember is the most direct and general way to do this. If your groups were more complicated, this would be the way to go:
population = [0,0,0,0,0; 1,1,1,0,0; 1,1,1,1,1; 1,1,1,0,0; 0,0,0,0,0; 0,0,0,0,0; 1,0,0,0,0; 1,1,1,1,1; 0,0,0,0,0; 0,0,0,0,0];
groups = [0,0,0,0,0; 1,0,0,0,0; 1,1,0,0,0; 1,1,1,0,0; 1,1,1,1,0; 1,1,1,1,1];
[~, whichGroup] = ismember(population, groups, 'rows');
freqOfGroup = accumarray(whichGroup, 1)/size(groups, 1);
In your special case the groups can be represented by their sums, so if this generic solution is not fast enough, use the sum-histc simplification Luis used.

Related

Creating Indicator matrix based on vector with group IDs

I have a vector of group IDs:
groups = [ 1 ; 1; 2; 2; 3];
which I want to use to create a matrix consisting of 1's in case the i-th and the j-th element are in the same group, and 0 otherwise. Currently I do this as follows:
n = size(groups, 1);
indMatrix = zeros(n,n);
for i = 1:n
for j = 1:n
indMatrix(i,j) = groups(i) == groups(j);
end
end
indMatrix
indMatrix =
1 1 0 0 0
1 1 0 0 0
0 0 1 1 0
0 0 1 1 0
0 0 0 0 1
Is there a better solution avoiding the nasty double for-loop? Thanks!
This can be done quite easily using implicit singleton expansion, for R2016b or later:
indMatrix = groups==groups.';
For MATLAB versions before R2016b you need bsxfun to achieve singleton expansion:
indMatrix = bsxfun(#eq, groups, groups.');

Find where condition is true n times consecutively

I have an array (say of 1s and 0s) and I want to find the index, i, for the first location where 1 appears n times in a row.
For example,
x = [0 0 1 0 1 1 1 0 0 0] ;
i = 5, for n = 3, as this is the first time '1' appears three times in a row.
Note: I want to find where 1 appears n times in a row so
i = find(x,n,'first');
is incorrect as this would give me the index of the first n 1s.
It is essentially a string search? eg findstr but with a vector.
You can do it with convolution as follows:
x = [0 0 1 0 1 1 1 0 0 0];
N = 3;
result = find(conv(x, ones(1,N), 'valid')==N, 1)
How it works
Convolve x with a vector of N ones and find the first time the result equals N. Convolution is computed with the 'valid' flag to avoid edge effects and thus obtain the correct value for the index.
Another answer that I have is to generate a buffer matrix where each row of this matrix is a neighbourhood of overlapping n elements of the array. Once you create this, index into your array and find the first row that has all 1s:
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
%// Solution
ind = bsxfun(#plus, (1:numel(x)-n+1).', 0:n-1); %'
out = find(all(x(ind),2), 1);
The first line is a bit tricky. We use bsxfun to generate a matrix of size m x n where m is the total number of overlapping neighbourhoods while n is the size of the window you are searching for. This generates a matrix where the first row is enumerated from 1 to n, the second row is enumerated from 2 to n+1, up until the very end which is from numel(x)-n+1 to numel(x). Given n = 3, we have:
>> ind
ind =
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
8 9 10
These are indices which we will use to index into our array x, and for your example it generates the following buffer matrix when we directly index into x:
>> x = [0 0 1 0 1 1 1 0 0 0];
>> x(ind)
ans =
0 0 1
0 1 0
1 0 1
0 1 1
1 1 1
1 1 0
1 0 0
0 0 0
Each row is an overlapping neighbourhood of n elements. We finally end by searching for the first row that gives us all 1s. This is done by using all and searching over every row independently with the 2 as the second parameter. all produces true if every element in a row is non-zero, or 1 in our case. We then combine with find to determine the first non-zero location that satisfies this constraint... and so:
>> out = find(all(x(ind), 2), 1)
out =
5
This tells us that the fifth location of x is where the beginning of this duplication occurs n times.
Based on Rayryeng's approach you can loop this as well. This will definitely be slower for short array sizes, but for very large array sizes this doesn't calculate every possibility, but stops as soon as the first match is found and thus will be faster. You could even use an if statement based on the initial array length to choose whether to use the bsxfun or the for loop. Note also that for loops are rather fast since the latest MATLAB engine update.
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
break
end
end
Additionally, this can be used to find the a first occurrences:
x = [0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
a = 2; %// number of desired matches
collect(1,a)=0; %// initialise output
kk = 1; %// initialise counter
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
collect(kk) = idx;
if kk == a
break
end
kk = kk+1;
end
end
Which does the same but shuts down after a matches have been found. Again, this approach is only useful if your array is large.
Seeing you commented whether you can find the last occurrence: yes. Same trick as before, just run the loop backwards:
for idx = numel(x)-n:-1:1
if all(x(idx:idx+n-1))
break
end
end
One possibility with looping:
i = 0;
n = 3;
for idx = n : length(x)
idx_true = 1;
for sub_idx = (idx - n + 1) : idx
idx_true = idx_true & (x(sub_idx));
end
if(idx_true)
i = idx - n + 1;
break
end
end
if (i == 0)
disp('No index found.')
else
disp(i)
end

Create a matrix with a diagonal and left-diagonal of all 1s in MATLAB

I would like to create a square matrix of size n x n where the diagonal elements as well as the left-diagonal are all equal to 1. The rest of the elements are equal to 0.
For example, this would be the expected result if the matrix was 5 x 5:
1 0 0 0 0
1 1 0 0 0
0 1 1 0 0
0 0 1 1 0
0 0 0 1 1
How could I do this in MATLAB?
Trivial using the tril function:
tril(ones(n),0) - tril(ones(n),-2)
And if you wanted a thicker line of 1s just adjust that -2:
n = 10;
m = 4;
tril(ones(n),0) - tril(ones(n),-m)
If you prefer to use diag like excaza suggested then try
diag(ones(n,1)) + diag(ones(n-1,1),-1)
but you can't control the 'thickness' of the stripe this way. However, for a thickness of 2, it might perform better. You'd have to test it though.
You can also use spdiags too to create that matrix:
n = 5;
v = ones(n,1);
d = full(spdiags([v v], [-1 0], n, n));
We get:
>> d
d =
1 0 0 0 0
1 1 0 0 0
0 1 1 0 0
0 0 1 1 0
0 0 0 1 1
The first two lines define the desired size of the matrix, assuming a square n x n as well as a vector of all ones that is of length n x 1. We then call spdiags to define where along the diagonal of this matrix this vector will be populating. We want to define the main diagonal to have all ones as well as the diagonal to the left of the main diagonal, or -1 away from the main diagonal. spdiags will adjust the total number of elements for the diagonal away from the main to compensate.
We also ensure that the output is of size n x n, but this matrix is actually sparse . We need to convert the matrix to full to complete the result.,
With a bit of indices juggling, you can also do this:
N = 5;
ind = repelem(1:N, 2); % [1 1 2 2 3 3 ... N N]
M = full(sparse(ind(2:end), ind(1:end-1), 1))
Simple approach using linear indexing:
n = 5;
M = eye(n);
M(2:n+1:end) = 1;
This can also be done with bsxfun:
n = 5; %// matrix size
d = [0 -1]; %// diagonals you want set to 1
M = double(ismember(bsxfun(#minus, 1:n, (1:n).'), d));
For example, to obtain a 5x5 matrix with the main diagonal and the two diagonals below set to 1, define n=5 and d = [0 -1 -2], which gives
M =
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1

Find number of consecutive ones in binary array

I want to find the lengths of all series of ones and zeros in a logical array in MATLAB. This is what I did:
A = logical([0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1]);
%// Find series of ones:
csA = cumsum(A);
csOnes = csA(diff([A 0]) == -1);
seriesOnes = [csOnes(1) diff(csOnes)];
%// Find series of zeros (same way, using ~A)
csNegA = sumsum(~A);
csZeros = csNegA(diff([~A 0]) == -1);
seriesZeros = [csZeros(1) diff(csZeros)];
This works, and gives seriesOnes = [4 2 5] and seriesZeros = [3 1 6]. However it is rather ugly in my opinion.
I want to know if there is a better way to do this. Performance is not an issue as this is inexpensive (A is no longer than a few thousand elements). I am looking for code clarity and elegance.
If nothing better can be done, I'll just put this in a little helper function so I don't have to look at it.
You could use an existing code for run-length-encoding, which does the (ugly) work for you and then filter out your vectors yourself. This way your helper function is rather general and its functionality is evident from the name runLengthEncode.
Reusing code from this answer:
function [lengths, values] = runLengthEncode(data)
startPos = find(diff([data(1)-1, data]));
lengths = diff([startPos, numel(data)+1]);
values = data(startPos);
You would then filter out your vectors using:
A = logical([0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1]);
[lengths, values] = runLengthEncode(A);
seriesOnes = lengths(values==1);
seriesZeros = lengths(values==0);
You can try this:
A = logical([0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1]);
B = [~A(1) A ~A(end)]; %// Add edges at start/end
edges_indexes = find(diff(B)); %// find edges
lengths = diff(edges_indexes); %// length between edges
%// Separate zeros and ones, to a cell array
s(1+A(1)) = {lengths(1:2:end)};
s(1+~A(1)) = {lengths(2:2:end)};
This strfind (works wonderfully with numeric arrays as well as string arrays) based approach could be easier to follow -
%// Find start and stop indices for ones and zeros with strfind by using
%// "opposite (0 for 1 and 1 for 0) sentients"
start_ones = strfind([0 A],[0 1]) %// 0 is the sentient here and so on
start_zeros = strfind([1 A],[1 0])
stop_ones = strfind([A 0],[1 0])
stop_zeros = strfind([A 1],[0 1])
%// Get lengths of islands of ones and zeros using those start-stop indices
length_ones = stop_ones - start_ones + 1
length_zeros = stop_zeros - start_zeros + 1

find specific zeros grouped in an array - matlab

I have an array like this one
a=[0 0 0 1 1 1 1 0 0];
and would like if the community would know an elegant way to find the last occurrence of 0 in the first group and the first occurrence of 0 in the last group of zeros. Please note there is always ones between the two groups of zeros. The answer should look like this
b=[0 0 1 0 0 0 0 1 0];
strfind based approach that works pretty well with numeric arrays to find patterns like these, seems like a good fit to solve it. Here's the implementation -
%// Find indices where we have matches of [0 1] and [1 0] corresponding to
%// the two cases as listed in the question
case1_idx = strfind(a,[0 1])
case2_idx = strfind(a,[1 0])
%// Initialize output array; set those required positions in it as ones
b = zeros(size(a))
b([case1_idx(1) case2_idx(end)+1]) = 1
Sample run -
a =
0 0 0 1 1 1 1 0 0
b =
0 0 1 0 0 0 0 1 0
How about this descriptive solution:
afterA = [a(2:end),nan]
beforeA = [nan,a(1:end-1)]
b = (a==0 & afterA==1) | (a==0 & beforeA==1)
d = diff(a)
res = zeros(size(a))
res(find(d==1)) = 1
res(find(d==-1)+1) = 1
or (assuming that a always starts and ends with a 0), you would not even need to search the entire array
res = zeros(size(a))
res(find(a, 1, 'first')-1) = 1
res(find(a, 1, 'last')+1) = 1
With convolution:
b = zeros(size(a)); %// initiallize
x = conv(2*a-1,[1 -1],'same'); %// convolution
b(find(x==2)) = 1; %// last zero in a run
b(find(x==-2)+1) = 1; %// first zero in a run
Or you could use the same approach with diff instead of conv:
b = zeros(size(a)); %// initiallize
c = diff(a); %// compute differences
b(find(c==1)) = 1; %// last zero in a run
b(find(c==-1)+1) = 1; %// first zero in a run

Resources