Relate indices where two dataframes are equal with elements in another array - arrays

I have an array cluster_true and a dataframe data containing in each row a 2D coordinate. I want to save in another dataframe information regarding how many times for a given 2D coordinate each element in cluster_true appeared. So, for instance, for the coordinate (1,1), I want to check all the rows in data whose first two columns have the value of 1, and then check the values of cluster_true at those indices. Here is an example to make it clearer (it gives the desired result):
# Example variables
cluster_true = c(1,2,1,1,2,2,1,2,2,2,2,1,1)
x = 3
y = 3
data = data.frame(X = c(1,1,0,0,2,1,1,0,0,0,1,1,1),
Y = c(1,1,2,1,2,2,1,0,0,0,0,2,0))
# Names of the columns
plot_colnames = c('X', 'Y', paste('cluster',unique(cluster_true),sep='_'))
# Empty dataframe with the right column names
plot_df = data.frame(matrix(vector(), x*y, length(plot_colnames),
dimnames=list(c(), plot_colnames)),
stringsAsFactors=F)
# Each row belongs to a certain 2D coordinate
plot_df$X = rep(1:x, y)-1
plot_df$Y = rep(1:x, each = y)-1
# This is what I don't know how to improve
for(i in 1:nrow(plot_df)){
idx = which(apply(data[,1:2], 1, function(x) all(x == plot_df[i,1:2])))
plot_df[i,3] = sum(cluster_true[idx] == 1)
plot_df[i,4] = sum(cluster_true[idx] == 2)
}
print(plot_df)
Things I need to change and I don't know how to:
I think the loop could be avoided in order to get a more elegant solution, but I don't know how. The dataframe data could have a very large amount of rows, so efficient code would be awesome.
Inside the loop, I've hardcoded the clusters to check (the last two lines inside the loop assume that I know which numbers are present in cluster_true and to which column of plot_df they correspond to). In fact, the elements in cluster_true could be anything, even non-consecutive numbers (i.e. cluster_true = c(1,5,5,5,56,10,19,10)).
So basically, I want to know if this could be done without the loop and as generic as possible.

If I understand correctly, the OP wants to
find the row indices for all unique combinations of X, Y coordinates in data,
look up the value in the corresponding rows of cluster_true,
count the number of occurrences of each value for the given X, Y combination, and
print the results in wide format.
This can be solved by joining and reshaping:
library(data.table) # version 1.11.4 used
library(magrittr) # use piping to improve readability
# unique coordinate pairs
uni_coords <- unique(setDT(data)[, .(X, Y)])[order(X, Y)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length)
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 2 0 1
5: 3 1 1 0
6: 3 2 1 0
7: 3 3 0 3
This is identical with OP's expected result except for the coordinate combinations which do not appear in data.
setDT(plot_df)[order(X, Y)]
X Y cluster_1 cluster_2
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3
Reshaping has the benefit that it can handle arbitrary values in cluster_true as requested by the OP.
Edit
The OP has requested that all possible combinations of X, Y coordinates should be included in the final result. This can be achieved by using a cross join CJ() to compute uni_coords:
# all possible coordinate pairs
uni_coords <- setDT(data)[, CJ(X = X, Y = Y, unique = TRUE)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI][
uni_coords, on = .(X, Y)] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length) %>%
# remove NA column from reshaped result
.[, cluster_NA := NULL] %>%
print()
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3

Related

Automatic test for the equality of columns between two matrices

I have two matrices:
X =
1 2 3
4 5 6
7 8 9
`Y` =
1 10 11
4 12 13
7 14 15
I know that if I want to find the index of a specific element in X or Y, I can use the function find. For example:
index_3 = find(X==3)
What I want is to find or search in a very automatic way if a column in X is also present in Y. In other terms, I want a function which can tell me if a column in X is equal to a column in Y. In fact to to try this, one can use the function ismember which indeed has an optional flag to compare rows:
rowsX = ismember(X, Y, 'rows');
So a simple way to get columns is just by taking the transpose of both matrices:
rowsX = ismember(X.', Y.', 'rows')
rowsX =
1
0
0
But how can I do that in other manner?
Any help will be very appreciated!
You can do that with bsxfun and permute:
rowsX = any(all(bsxfun(#eq, X, permute(Y, [1 3 2])), 1), 3);
With
X = [ 1 2 3
4 5 6
7 8 9 ];
Y = [ 1 10 11
4 12 13
7 14 15 ];
this gives
rowsX =
1 0 0
How it works
permute "turns Y 90 degrees" along a vertical axis, so columns of Y are kept aligned with columns of X, but rows of Y are moved to the third dimension. Testing for equality with bsxfun and applying all(...,1) gives a matrix that tells which columns of X equal which columns of Y. Then any(...,3) produces the desired result: true if a column of X equals any column of Y.

merge two matrix and its attributes in matlab

I've two matrix a and b and I'd like to combine the rows in a way that in the first row I got no duplicate value and in the second value, columns in a & b which have the same row value get added together in new matrix. i.e.
a =
1 2 3
8 2 5
b =
1 2 5 7
2 4 6 1
Desired outputc =
1 2 3 5 7
10 6 5 6 1
Any help is welcomed,please.
For two-row matrices
You want to add second-row values corresponding to the same first-row value. This is a typical use of unique and accumarray:
[ii, ~, kk] = unique([a(1,:) b(1,:)]);
result = [ ii; accumarray(kk(:), [a(2,:) b(2,:)].').'];
General case
If you need to accumulate columns with an arbitrary number of columns (based on the first-row value), you can use sparse as follows:
[ii, ~, kk] = unique([a(1,:) b(1,:)]);
r = repmat((1:size(a,1)-1).', 1, numel(kk));
c = repmat(kk.', size(a,1)-1, 1);
result = [ii; full(sparse(r,c,[a(2:end,:) b(2:end,:)]))];

Matlab: how to find an enclosing grid cell index for multiple points

I am trying to allocate (x, y) points to the cells of a non-uniform rectangular grid. Simply speaking, I have a grid defined as a sorted non-equidistant array
xGrid = [x1, x2, x3, x4];
and an array of numbers x lying between x1 and x4. For each x, I want to find its position in xGrid, i.e. such i that
xGrid(i) <= xi <= xGrid(i+1)
Is there a better (faster/simpler) way to do it than arrayfun(#(x) find(xGrid <= x, 1, 'last'), x)?
You are looking for the second output of histc:
[~,where] = histc(x, xGrid)
This returns the array where such that xGrid(where(i)) <= x(i) < xGrid(where(i)+1) holds.
Example:
xGrid = [2,4,6,8,10];
x = [3,5,6,9,11];
[~,where] = histc(x, xGrid)
Yields the following output:
where =
1 2 3 4 0
If you want xGrid(where(i)) < x(i) <= xGrid(where(i)+1), you need to do some trickery of negating the values:
[~,where] = histc(-x,-flip(xGrid));
where(where~=0) = numel(xGrid)-where(where~=0)
This yields:
where =
1 2 2 4 0
Because x(3)==6 is now counted for the second interval (4,6] instead of [6,8) as before.
Using bsxfun for the comparisons and exploiting find-like capabilities of max's second output:
xGrid = [2 4 6 8]; %// example data
x = [3 6 5.5 10 -10]; %// example data
comp = bsxfun(#gt, xGrid(:), x(:).'); %'// see if "x" > "xGrid"
[~, result] = max(comp, [], 1); %// index of first "xGrid" that exceeds each "x"
result = result-1; %// subtract 1 to find the last "xGrid" that is <= "x"
This approach gives 0 for values of x that lie outside xGrid. With the above example values,
result =
1 3 2 0 0
See if this works for you -
matches = bsxfun(#le,xGrid(1:end-1),x(:)) & bsxfun(#ge,xGrid(2:end),x(:))
[valid,pos] = max(cumsum(matches,2),[],2)
pos = pos.*(valid~=0)
Sample run -
xGrid =
5 2 1 6 8 9 2 1 6
x =
3 7 14
pos =
8
4
0
Explanation on the sample run -
First element of x, 3 occurs last between ...1 6 with the criteria of xGrid(i) <= xi <= xGrid(i+1) at the backend of xGrid and that 1 is at the eight position, so the first element of the output pos is 8. This continues for the second element 7, which is found between 6 and 8 and that 6 is at the fourth place in xGrid, so the second element of the output is 4. For the third element 14 which doesn't find any neighbours to satisfy the criteria xGrid(i) <= xi <= xGrid(i+1) and is therefore outputted as 0.
If x is a column this might help
xg1=meshgrid(xGrid,1:length(x));
xg2=ndgrid(x,1:length(xGrid));
[~,I]=min(floor(abs(xg1-xg2)),[],2);
or a single line implementation
[~,I]=min(floor(abs(meshgrid(xGrid,1:length(x))-ndgrid(x,1:length(xGrid)))),[],2);
Example: xGrid=[1 2 3 4 5], x=[2.5; 1.3; 1.7; 4.8; 3.3]
Result:
I =
2
1
1
4
3

Vectorization- Matlab

Given a vector
X = [1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3]
I would like to generate a vector such
Y = [1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5]
So far what I have got is
idx = find(diff(X))
Y = [1:idx(1) 1:idx(2)-idx(1) 1:length(X)-idx(2)]
But I was wondering if there is a more elegant(robust) solution?
One approach with diff, find & cumsum for a generic case -
%// Initialize array of 1s with the same size as input array and an
%// intention of using cumsum on it after placing "appropriate" values
%// at "strategic" places for getting the final output.
out = ones(size(X))
%// Find starting indices of each "group", except the first group, and
%// by group here we mean run of identical numbers.
idx = find(diff(X))+1
%// Place differentiated and subtracted values of indices at starting locations
out(idx) = 1-diff([1 idx])
%// Perform cumulative summation for the final output
Y = cumsum(out)
Sample run -
X =
1 1 1 1 2 2 3 3 3 3 3 4 4 5
Y =
1 2 3 4 1 2 1 2 3 4 5 1 2 1
Just for fun, but customary bsxfun based alternative solution -
%// Logical mask with each column of ones for presence of each group elements
mask = bsxfun(#eq,X(:),unique(X(:).')) %//'
%// Cumulative summation along columns and use masked values for final output
vals = cumsum(mask,1)
Y = vals(mask)
Here's another approach:
Y = sum(triu(bsxfun(#eq, X, X.')), 1);
This works as follows:
Compare each element with all others (bsxfun(...)).
Keep only comparisons with current or previous elements (triu(...)).
Count, for each element, how many comparisons are true (sum(..., 1)); that is, how many elements, up to and including the current one, are equal to the current one.
Another method is using the function unique
like this:
[unqX ind Xout] = unique(X)
Y = [ind(1):ind(2) 1:ind(3)-ind(2) 1:length(X)-ind(3)]
Whether this is more elegant is up to you.
A more robust method will be:
[unqX ind Xout] = unique(X)
for ii = 1:length(unqX)-1
Y(ind(ii):ind(ii+1)-1) = 1:(ind(ii+1)-ind(ii));
end

MATLAB: locate the first position of each unique number from a vector

I wish to locate the first position of each unique number from a vector but without a for loop:
e.g
a=[1 1 2 2 3 4 2 1 3 4];
and I can obtain the unique number by having:
uniq=unique(a);
where uniq = [1 2 3 4]
What I want is to obtain each number's first appearance location, any ideas????
first_pos = [1 3 5 6]
where 1 is firstly appear in position 1, 4 is firstly appear in the sixth position from the vector
ALSO, what about the position of the second appearance??
second_pos = [2 4 9 10]
Thank you very much
Use the second output of unique, and use the 'first' option:
>> A = [1 1 2 2 3 4 2 1 3 4];
>> [a,b] = unique(A, 'first')
a =
1 2 3 4 %// the unique values
b =
1 3 5 6 %// the first indices where these values occur
To find the locations of the second occurrences,
%// replace first occurrences with some random number
R = rand;
%// and do the same as before
A(b) = R;
[a2,b2] = unique(A, 'first');
%// Our random number is NOT part of original vector
b2(a2==R)=[];
a2(a2==R)=[];
with this:
b2 =
2 4 9 10
Note that there will have to be at least 2 occurrences of each number in the vector A if the sizes of b and b2 are to agree (this was not the case before your edit).

Resources