How to delete rows from a matrix that contain more than 50% zeros MATLAB - arrays

I want to remove the rows in an array that contain more than 50% of null elements.
eg:
if the input is
1 0 0 0 5 0
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
0 0 4 0 1 0
I want to remove rows 1 and 5, but retain the rest. The output should look like:
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
I want to do this using matlab

Use logical indexing into the rows, based on the mean of the rows of A negated:
t = .5; % threshold
A(mean(A==0,2) > t, :) = [];
What this does:
Compare A with 0: turns zeros into true, and nonzeros into false.
Compute the mean of each row.
Compare that to the desired threshold.
Use the result as a logical index to delete unwanted rows.
Equivalently, you can keep the wanted rows instead of removing the unwanted ones. This may be faster depending on the proportion of rows:
A = A(mean(A~=0,2) >= 1-t, :);

You can also use the standardizeMissing function and rmmissing function together to achieve this:
>> [~,tf] = rmmissing(standardizeMissing(A,0),'MinNumMissing',floor(0.5*size(A,2))+1);
>> A(~tf,:)
The call to standardizeMissing replaces the 0 values with NaN (the standard missing indicator for double), then the rmmissing call identifies in the logical vector tf the rows that have more than 50% of their entries as 0 (i.e., those rows that have more than floor(0.5*size(A,2))+1 0-valued entries. Then you can just negate the tf output and use it as an indexer. You can adapt the minimum number missing easily to satisfy whatever percentage criteria you want.
Also note that tf is a logical vector here that is only the size of the number of rows of A.

As I mentioned on Luis' answer, one downside to his approach is that it requires an intermediate logical array of the same size as A to be created, which can potentially incur a significant memory/performance penalty when working with large arrays.
An explicit looped approach with nnz (overly verbose, for clarity):
[nrows, ncols] = size(A);
maximum_ratio_of_zeros = 0.5;
minimum_ratio_of_nonzeros = 1 - maximum_ratio_of_zeros;
todelete = false(nrows, 1);
for ii = 1:nrows
if nnz(A(ii,:))/ncols < minimum_ratio_of_nonzeros
todelete(ii) = true;
end
end
A(todelete,:) = [];
Which returns the desired answer.

Related

Efficient algorithm for merging consistent rows of multiple 2d arrays

I have an arbitrary number of 2d arrays of equal width but possibly non-equal height. They can consist of 1s, 0s, or a wildcard * which can match either a 1 or a 0. The wildcards are always in the same columns. I want to return all possible rows that are consistent with at least one row in every array simultaneously, and contain no wild cards.
For a concrete example, consider the three 2d arrays
1 0 1 * * 0 1 0 * 1 * 1
a = 1 1 1 * b = * 1 1 0 c = * 0 * 0
0 1 0 * * 0 0 1
A possible row in the solution might be 1 0 1 0. It is consistent with the top row of a, the top row of b, and the second row of c. By contrast, a row that should not be in the solution is 0 1 0 1, since it is not consistent with any row of b despite being consistent with the bottom row of a and the top row of c.
Beyond doing an inefficient brute-force check I'm rather stuck. It seems like there should be a faster way. Are there are any tricks that might help solve this problem efficiently?

Matlab One Hot Encoding - convert column with categoricals into several columns of logicals

CONTEXT
I have a large number of columns with categoricals, all with different, unrankable choices. To make my life easier for analysis, I'd like to take each of them and convert it to several columns with logicals. For example:
1 GENRE
2 Pop
3 Classical
4 Jazz
...would turn into...
1 Pop Classical Jazz
2 1 0 0
3 0 1 0
4 0 0 1
PROBLEM
I've tried using ind2vec but this only works with numericals or logicals. I've also come across this but am not sure it works with categoricals. What is the right function to use in this case?
If you want to convert from a categorical vector to a logical array, you can use the unique function to generate column indices, then perform your encoding using any of the options from this related question:
% Sample data:
data = categorical({'Pop'; 'Classical'; 'Jazz'; 'Pop'; 'Pop'; 'Jazz'});
% Get unique categories and create indices:
[genre, ~, index] = unique(data)
genre =
Classical
Jazz
Pop
index =
3
1
2
3
3
2
% Create logical matrix:
mat = logical(accumarray([(1:numel(index)).' index], 1))
mat =
6×3 logical array
0 0 1
1 0 0
0 1 0
0 0 1
0 0 1
0 1 0
ind2vec do work with the cell strings, and you could call cellstr function to get such a cell string.
This codes may help (From this ,I only changed a little)
data = categorical({'Pop'; 'Classical'; 'Jazz';});
GENRE = cellstr(data); %change categorical data into cell strings
[~, loc] = ismember(GENRE, unique(GENRE));
genre = ind2vec(loc')';
Gen=full(genre);
array2table(Gen, 'VariableNames', unique(GENRE))
run such a code will return this:
ans =
Classical Jazz Pop
_________ ____ ___
0 0 1
1 0 0
0 1 0
you can call unique(GENRE) to check the categories(in cell strings). In the meanwhile, logical(Gen)(or call logical(full(genre))) contain columns with logical that you need.
P.s. categorical structure might be faster than cell string, but ind2vec function doesn't work with it. unique and accumarray might better.

MATLAB removing rows which has duplicates in sequence

I'm trying to remove the rows which has duplicates in sequence. I have only 2 possible values which are 0 and 1. I have nXm which n shows possible number of bits and m is not important for my question. My goal is to find an matrix which is nX(m-a). The rows a which has the property which includes duplicates in sequence. For example:
My matrix is :
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0]
I want to remove the rows has t duplicates in sequence for 0. In this question let's assume t is 3. So I want the matrix which:
B=[0 1 0 1 0 1;
0 0 1 0 0 1;
0 1 0 0 1 0]
2nd and 5th rows are removed.
I probably need to use diff.
So you want to remove rows of A that contain at least t zeros in sequence.
How about a single line?
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==-t, 2),:);
How this works:
Transform A to bipolar form (2*A-1)
Convolve each row with a sequence of t ones (conv2(...))
Keep only rows for which the convolution does not contain -t (~any(...)). The presence of -t indicates a sequence of t zeros in the corresponding row of A.
To remove rows that contain at least t ones, just change -t to t:
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==t, 2),:);
Here is a generalized approach which removes any rows which has given number of consecutive duplicates (not just zero. could be any number).
t = 3;
row_mask = ~any(all(~diff(reshape(im2col(A,[1 t],'sliding'),t,size(A,1),[]))),3);
out = A(row_mask,:)
Sample Run:
>> A
A =
0 1 0 1 0 1
0 0 1 5 5 5 %// consecutive 3 5's
0 0 1 0 0 1
0 1 0 0 1 0
1 1 1 0 0 1 %// consecutive 3 1's
>> out
out =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
How about an approach using strings? This is certainly not as fast as Luis Mendo's method where you work directly with the numerical array, but it's thinking a bit outside of the box. The basis of this approach is that I consider each row of A to be a unique string, and I can search each string for occurrences of a string of 0s by regular expressions.
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0];
t = 3;
B = sprintfc('%s', char('0' + A));
ind = cellfun('isempty', regexp(B, repmat('0', [1 t])));
B(~ind) = [];
B = double(char(B) - '0');
We get:
B =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
Explanation
Line 1: Convert each line of the matrix A into a string consisting of 0s and 1s. Each line becomes a cell in a cell array. This uses the undocumented function sprintfc to facilitate this cell array conversion.
Line 2: I use regular expressions to find any occurrences of a string of 0s that is t long. I first use repmat to create a search string that is full of 0s and is t long. After, I determine if each line in this cell array contains this sequence of characters (i.e. 000....). The function regexp helps us perform regular expressions and returns the locations of any matches for each cell in the cell array. Alternatively, you can use the function strfind for more recent versions of MATLAB to speed up the computation, but I chose regexp so that the solution is compatible with most MATLAB distributions out there.
Continuing on, the output of regexp/strfind is a cell array of elements where each cell reports the locations of where we found the particular string. If we have a match, there should be at least one location that is reported at the output, so I check to see if any matches are empty, meaning that these are the rows we don't want to remove. I want to turn this into a logical array for the purposes of removing rows from A, and so this is wrapped with a cellfun call to determine the cells that are empty. Therefore, this line returns a logical array where a 0 means that remove this row and a 1 means that we don't.
Line 3: I take the logical array from Line 2 and invert it because that's what we really want. We use this inverted array to index into the cell array and remove those strings.
Line 4: The output is still a cell array, so I convert it back into a character array, and finally back into a numerical array.

graph representing the randomization of each column in a binary matrix

Imagine the following binary image exemplified by the matrix below. This is a simplified version of the images I'll be working with:
0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1
I want to construct a graph that will represent the randomness of each column. My thought is to develop a random index = the total transitions between each value in the column / by the total possible transitions. In the matrix above, each column could have a total possible of 3 transitions.
For the example above:
Column 1 would have a random index of 0% (0/3)
Column 2 would have a random index of 66.7% (2/3)
Column 3 = 100% (3/3)
Column 4 = 0% (0/3) even though they are 1's and not 0's. Doesn't matter, I just want the transitions.
Can I draw a boundary around all the 1 values and then have MATLAB sum all of the boundaries?
To calculate what you are suggesting you can just do:
sum( diff(A) ~= 0 )
The diff(A) will take the forward difference down the columns and the sum will count the number of non-zero changes. So if you do this you will get:
ans =
0 2 3 0
Let your image be defined as
im = [ 0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1 ];
The random index you want can be computed as
result = sum(diff(im)~=0) / (size(im,1)-1);
Explanation: diff computes the difference between consecutive elemtents down each column. The result is compared against zero (~=0), and all nonzero values within each row are added (with sum). Finally, the result is divided by the maximum number os transitions, which is the number of rows minus 1 (size(im,1)-1)
Equivalently, you could use xor between consecutive rows:
result = sum(xor(im(1:end-1,:), im(2:end,:))) / (size(im,1)-1)

Look at each row separately in a matrix (Matlab)

I have a matrix in Matlab(2012) with 3 columns and X number of rows, X is defined by the user, so varies each time. For this example though I will use a fixed 5x3 matrix.
So I would like to perform an iterative function on each row within the matrix, while the value in the third column is below a certain value. Then store the new values within the same matrix, so overwrite the original values.
The code below is a simplified version of the problem.
M=[-2 -5 -3 -2 -4]; %Vector containing random values
Vf_X=M+1; %Defining the first column of the matrix
Vf_Y=M+2; %Defining the secound column of the matrix
Vf_Z=M; %Defining the third column of the matrix
Vf=[Vf_X',Vf_Y',Vf_Z']; %Creating the matrix
while Vf(:,3)<0
Vf=Vf+1;
end
disp(Vf)
The result I get is
1 2 0
-2 -1 -3
0 1 -1
1 2 0
-1 0 -2
Ideally I would like to get this result instead
1 2 0
1 2 0
1 2 0
1 2 0
1 2 0
The while will not start if any value is above zero to begin with and stops as soon as one value goes above zero.
I hope this makes sense and I have supplied enough information
Thank you for your time and help.
Your current problem is that you stop iterating the very moment any of the values in the third row break the condition. Correct me if I'm wrong, but what I think you want is to continue doing iterations on the remaining rows, until the conditions are broken by all third columns.
You could do that like this:
inds = true(size(Vf,1),1);
while any(inds)
Vf(inds,:) = Vf(inds,:)+1;
inds = Vf(:,3) < 0;
end
Of course, for the simple addition you provide, there is a better and faster way:
inds = Vf(:,3)<0;
Vf(inds,:) = bsxfun(#minus, Vf(inds,:), Vf(inds,3));
But for general functions, the while above will do the trick.

Resources