Matlab One Hot Encoding - convert column with categoricals into several columns of logicals - arrays

CONTEXT
I have a large number of columns with categoricals, all with different, unrankable choices. To make my life easier for analysis, I'd like to take each of them and convert it to several columns with logicals. For example:
1 GENRE
2 Pop
3 Classical
4 Jazz
...would turn into...
1 Pop Classical Jazz
2 1 0 0
3 0 1 0
4 0 0 1
PROBLEM
I've tried using ind2vec but this only works with numericals or logicals. I've also come across this but am not sure it works with categoricals. What is the right function to use in this case?

If you want to convert from a categorical vector to a logical array, you can use the unique function to generate column indices, then perform your encoding using any of the options from this related question:
% Sample data:
data = categorical({'Pop'; 'Classical'; 'Jazz'; 'Pop'; 'Pop'; 'Jazz'});
% Get unique categories and create indices:
[genre, ~, index] = unique(data)
genre =
Classical
Jazz
Pop
index =
3
1
2
3
3
2
% Create logical matrix:
mat = logical(accumarray([(1:numel(index)).' index], 1))
mat =
6×3 logical array
0 0 1
1 0 0
0 1 0
0 0 1
0 0 1
0 1 0

ind2vec do work with the cell strings, and you could call cellstr function to get such a cell string.
This codes may help (From this ,I only changed a little)
data = categorical({'Pop'; 'Classical'; 'Jazz';});
GENRE = cellstr(data); %change categorical data into cell strings
[~, loc] = ismember(GENRE, unique(GENRE));
genre = ind2vec(loc')';
Gen=full(genre);
array2table(Gen, 'VariableNames', unique(GENRE))
run such a code will return this:
ans =
Classical Jazz Pop
_________ ____ ___
0 0 1
1 0 0
0 1 0
you can call unique(GENRE) to check the categories(in cell strings). In the meanwhile, logical(Gen)(or call logical(full(genre))) contain columns with logical that you need.
P.s. categorical structure might be faster than cell string, but ind2vec function doesn't work with it. unique and accumarray might better.

Related

How to delete rows from a matrix that contain more than 50% zeros MATLAB

I want to remove the rows in an array that contain more than 50% of null elements.
eg:
if the input is
1 0 0 0 5 0
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
0 0 4 0 1 0
I want to remove rows 1 and 5, but retain the rest. The output should look like:
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
I want to do this using matlab
Use logical indexing into the rows, based on the mean of the rows of A negated:
t = .5; % threshold
A(mean(A==0,2) > t, :) = [];
What this does:
Compare A with 0: turns zeros into true, and nonzeros into false.
Compute the mean of each row.
Compare that to the desired threshold.
Use the result as a logical index to delete unwanted rows.
Equivalently, you can keep the wanted rows instead of removing the unwanted ones. This may be faster depending on the proportion of rows:
A = A(mean(A~=0,2) >= 1-t, :);
You can also use the standardizeMissing function and rmmissing function together to achieve this:
>> [~,tf] = rmmissing(standardizeMissing(A,0),'MinNumMissing',floor(0.5*size(A,2))+1);
>> A(~tf,:)
The call to standardizeMissing replaces the 0 values with NaN (the standard missing indicator for double), then the rmmissing call identifies in the logical vector tf the rows that have more than 50% of their entries as 0 (i.e., those rows that have more than floor(0.5*size(A,2))+1 0-valued entries. Then you can just negate the tf output and use it as an indexer. You can adapt the minimum number missing easily to satisfy whatever percentage criteria you want.
Also note that tf is a logical vector here that is only the size of the number of rows of A.
As I mentioned on Luis' answer, one downside to his approach is that it requires an intermediate logical array of the same size as A to be created, which can potentially incur a significant memory/performance penalty when working with large arrays.
An explicit looped approach with nnz (overly verbose, for clarity):
[nrows, ncols] = size(A);
maximum_ratio_of_zeros = 0.5;
minimum_ratio_of_nonzeros = 1 - maximum_ratio_of_zeros;
todelete = false(nrows, 1);
for ii = 1:nrows
if nnz(A(ii,:))/ncols < minimum_ratio_of_nonzeros
todelete(ii) = true;
end
end
A(todelete,:) = [];
Which returns the desired answer.

Calculating mean over an array of lists in R

I have an array built to accept the outputs of a modelling package:
M <- array(list(NULL), c(trials,3))
Where trials is a number that will generate circa 50 sets of data.
From a sampling loop, I am inserting a specific aspect of the outputs. The output from the modelling package looks a little like this:
Mt$effects
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
And I am inserting it into my array via a loop
For(x in 1:trials) {
Mt<-run_model(params)
M[[x,3]] <- Mt$effects
}
The object now looks as follows
M[,3]
[[1]]
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
[[2]]
c_name effect Other
1 DPC_I 0.0717384637 0
2 DPR_I 0.0190812375 0
3 DPA_I 0.0856456427 0
4 DR_I 0.2330002551 0
5 (etc.)
[[3]]
And so on (up to 50 elements).
What I want to do is calculate an average (and sd) of effect, grouped by each c_name, across each of these 50 trial runs, but I’m unable to extract the data in to a single dataframe (for example) so that I can run a ddply summarise across them.
I have tried various combinations of rbind, cbind, unlist, but I just can’t understand how to correctly lift this data out of the sequential elements. I note also that any reference to .names results in NULL.
Any solution would be most appreciated!

MATLAB removing rows which has duplicates in sequence

I'm trying to remove the rows which has duplicates in sequence. I have only 2 possible values which are 0 and 1. I have nXm which n shows possible number of bits and m is not important for my question. My goal is to find an matrix which is nX(m-a). The rows a which has the property which includes duplicates in sequence. For example:
My matrix is :
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0]
I want to remove the rows has t duplicates in sequence for 0. In this question let's assume t is 3. So I want the matrix which:
B=[0 1 0 1 0 1;
0 0 1 0 0 1;
0 1 0 0 1 0]
2nd and 5th rows are removed.
I probably need to use diff.
So you want to remove rows of A that contain at least t zeros in sequence.
How about a single line?
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==-t, 2),:);
How this works:
Transform A to bipolar form (2*A-1)
Convolve each row with a sequence of t ones (conv2(...))
Keep only rows for which the convolution does not contain -t (~any(...)). The presence of -t indicates a sequence of t zeros in the corresponding row of A.
To remove rows that contain at least t ones, just change -t to t:
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==t, 2),:);
Here is a generalized approach which removes any rows which has given number of consecutive duplicates (not just zero. could be any number).
t = 3;
row_mask = ~any(all(~diff(reshape(im2col(A,[1 t],'sliding'),t,size(A,1),[]))),3);
out = A(row_mask,:)
Sample Run:
>> A
A =
0 1 0 1 0 1
0 0 1 5 5 5 %// consecutive 3 5's
0 0 1 0 0 1
0 1 0 0 1 0
1 1 1 0 0 1 %// consecutive 3 1's
>> out
out =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
How about an approach using strings? This is certainly not as fast as Luis Mendo's method where you work directly with the numerical array, but it's thinking a bit outside of the box. The basis of this approach is that I consider each row of A to be a unique string, and I can search each string for occurrences of a string of 0s by regular expressions.
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0];
t = 3;
B = sprintfc('%s', char('0' + A));
ind = cellfun('isempty', regexp(B, repmat('0', [1 t])));
B(~ind) = [];
B = double(char(B) - '0');
We get:
B =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
Explanation
Line 1: Convert each line of the matrix A into a string consisting of 0s and 1s. Each line becomes a cell in a cell array. This uses the undocumented function sprintfc to facilitate this cell array conversion.
Line 2: I use regular expressions to find any occurrences of a string of 0s that is t long. I first use repmat to create a search string that is full of 0s and is t long. After, I determine if each line in this cell array contains this sequence of characters (i.e. 000....). The function regexp helps us perform regular expressions and returns the locations of any matches for each cell in the cell array. Alternatively, you can use the function strfind for more recent versions of MATLAB to speed up the computation, but I chose regexp so that the solution is compatible with most MATLAB distributions out there.
Continuing on, the output of regexp/strfind is a cell array of elements where each cell reports the locations of where we found the particular string. If we have a match, there should be at least one location that is reported at the output, so I check to see if any matches are empty, meaning that these are the rows we don't want to remove. I want to turn this into a logical array for the purposes of removing rows from A, and so this is wrapped with a cellfun call to determine the cells that are empty. Therefore, this line returns a logical array where a 0 means that remove this row and a 1 means that we don't.
Line 3: I take the logical array from Line 2 and invert it because that's what we really want. We use this inverted array to index into the cell array and remove those strings.
Line 4: The output is still a cell array, so I convert it back into a character array, and finally back into a numerical array.

graph representing the randomization of each column in a binary matrix

Imagine the following binary image exemplified by the matrix below. This is a simplified version of the images I'll be working with:
0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1
I want to construct a graph that will represent the randomness of each column. My thought is to develop a random index = the total transitions between each value in the column / by the total possible transitions. In the matrix above, each column could have a total possible of 3 transitions.
For the example above:
Column 1 would have a random index of 0% (0/3)
Column 2 would have a random index of 66.7% (2/3)
Column 3 = 100% (3/3)
Column 4 = 0% (0/3) even though they are 1's and not 0's. Doesn't matter, I just want the transitions.
Can I draw a boundary around all the 1 values and then have MATLAB sum all of the boundaries?
To calculate what you are suggesting you can just do:
sum( diff(A) ~= 0 )
The diff(A) will take the forward difference down the columns and the sum will count the number of non-zero changes. So if you do this you will get:
ans =
0 2 3 0
Let your image be defined as
im = [ 0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1 ];
The random index you want can be computed as
result = sum(diff(im)~=0) / (size(im,1)-1);
Explanation: diff computes the difference between consecutive elemtents down each column. The result is compared against zero (~=0), and all nonzero values within each row are added (with sum). Finally, the result is divided by the maximum number os transitions, which is the number of rows minus 1 (size(im,1)-1)
Equivalently, you could use xor between consecutive rows:
result = sum(xor(im(1:end-1,:), im(2:end,:))) / (size(im,1)-1)

an array of arrays varied in length in R

I use R for my statistical analysis.
I wanna group my data in an array based on the ID column. This results in having an array of unique IDs which each cell includes a data array of correspondence ID. Since the number of the data per ID is not similar, therefor each array in each cell has different length.
So I wonder how I can create an array of arrays varied in length using R?
I already having the following codes but get an error:
#number of unique IDs
size<-unique(data[,1]);
for (i in 1:length (gr))
{
index<- which(data[,1]==gr[i]);
data_c[[i,1]]<-data[index,];
}
Here is the error
more elements supplied than there are to replace
Thanks in advance for any comment.
I explain my problem by an example:
I have following data called it DATA_ALL:
DATA_ALL[]=
id age T1 T2 T3 T4
1 20 1 0 0 0
1 20 NA 0 NA 0
1 20 0 0 0 0
5 30 1 NA 0 0
5 30 0 0 0 1
6 40 0 1 0 0
I want to group the data of each id and put all in an array (array of arrays):
DATA_GROUPED []=
id data
1 1 X1[]=[an array includes all data from DATA_ALL where the id=1]
2 5 X2[]=[an array includes all data from DATA_ALL where the id=5]
3 6 X3[]=[an array includes all data from DATA_ALL where the id=6]
Please note that the length of X1!=X2!=X3
So how I can create the DATA_GROUPED[] matrix??
It is nearly impossible to answer your question in relation to your code, but in general, I think what you want to do is create a list of vectors, a bit like this:
one<-letters[1]
two<-letters[2:3]
three<-letters[4:6]
combined<-list(one=one, two=two, three=three)
Be sure to use indexing correctly now, and preferably with [[:
for(i in 1:length(combined))
{
cat("The contents of item", names(combined)[i], "are:", combined[[i]], "\n")
}
Output:
The contents of item one are: a
The contents of item two are: b c
The contents of item three are: d e f
Edit (following edit of question):
split.data.frame(DATA_ALL, DATA_ALL[,1])
Check ?split and note the first paragraph in Details.
Note this indeed creates a list of matrices/arrays.

Resources