complete rows across multiple arrays in matlab - arrays

I have two arrays and I need to count the number of rows that do not contain an NaN in any column in either array. I want sample size after using an array of inputs to train a vector of targets (where NaN rows are not used). Here is an example of my current solution:
% A matrix
A = [
-0.0057 14.8750 293.2000 2.3743 0 NaN -0.1186 NaN 38.1000
2.1543 10.2240 294.0200 1.7650 0 NaN 0.0962 NaN 30.4800
2.6071 7.1014 266.4000 1.3941 0 NaN -0.1110 23.6660 27.9400
0.9736 10.5730 271.2000 1.8700 0 NaN -0.2457 31.7290 27.9400
-0.7138 13.6430 286.3100 2.0655 0 NaN -0.5152 44.3640 27.9400
4.4969 5.5410 280.1600 0.6042 0 NaN -0.2783 47.9240 27.9400
5.4186 2.5648 251.6900 0.2323 0 NaN -0.0879 39.6710 25.4000
4.3641 3.4062 266.7800 0.5696 0 NaN -0.0638 26.9330 25.4000
-0.3348 8.2900 258.8900 1.3736 0 NaN -0.0414 59.2570 25.4000
0.3007 8.3617 274.7400 1.3929 0 NaN -0.3473 46.6710 25.4000
3.0400 4.6077 267.3400 0.9704 0 0.5178 -0.2080 32.4850 25.4000
2.1950 7.7303 253.8300 1.3545 0 0.4927 -0.0870 31.4520 25.4000
-0.4413 4.2283 275.7400 0.4724 0 0.3687 -0.2470 40.3630 27.9400
-0.8667 4.0397 261.0800 0.6118 0 0.4143 -0.4723 28.7360 27.9400
-8.0407 2.2782 158.9600 0.4654 0 0.1775 -0.9863 56.7880 30.4800
-15.4630 2.0072 230.4100 0.2572 0 0.0530 -2.2110 71.3660 35.5600
-14.7670 6.6983 293.4800 0.9218 0 0.1224 -4.3823 42.2330 38.1000
-8.5713 4.2573 249.6900 0.5928 0 0.2057 -4.6927 37.2790 38.1000
-13.4820 1.4811 120.2200 0.2327 0 0.0542 -4.1213 76.5140 38.1000
-15.6230 3.9040 300.8400 0.2369 0 0.0602 -3.4780 71.9860 NaN]
% And a vector of inputs
B = [
NaN
NaN
1.2009
0.6404
0.5739
0.6846
0.4121
0.7475
0.5931
0.5706
0.8581
0.9910
NaN
0.5652
0.4008
NaN
0.4585
0.5463
0.2903
0.3150]
% Inputs
Alogic = isnan(A); % logical matrix of nans for drivers used
AlogicNaNSum = sum(Alogic,2); % sum by row
ANaNSumlogic = AlogicNaNSum >0; % logical by row with 0 for complete, 1 for some row containing NaNs
% Target
Blogic = isnan(B); % logical version of target with 0 for complete, 1 for containing NaN
SumNaNrows = Blogic + ANaNSumlogic; % add logical vectors, with 0 meaning no NaNs in any column
% Final number of rows with no NaN in any column
complete = sum(SumNaNrows(:)==0)
It seems like there should be a more elegant way to do this (fewer lines of code) that could still apply to vectors and/or matrices of the same length. There are many posts already about finding and replacing NaN rows like this and this, but I haven't found as much about counting the total number of complete rows across arrays.

You can do this using some basic logical operations. As you've shown we can use isnan to create a logical matrix the size of your input where it's true where there is a NaN. We can then use any combined with the second input to check which rows have any NaN values in them. We can then use the element-wise or (|) to create a logical matrix where we want the result to be true if a row in A has a NaN value or there is a NaN value in the corresponding location in B.
toremove = any(isnan(A), 2) | isnan(B);
Then if you simply want the number of rows that match:
complete = sum(~(any(isnan(A), 2) | isnan(B)));
You could also flip the logic around a little bit and check for rows that have no NaN values. The results will be the same
tokeep = all(~isnan(A), 2) & ~isnan(B);
complete = sum(tokeep);
Yet another alternative would be to simply append B as a new column of A and just check the resulting matrix for rows which don't contain any NaN values
tokeep = ~any(isnan([A, B]), 2)

Related

How to delete rows from a matrix that contain more than 50% zeros MATLAB

I want to remove the rows in an array that contain more than 50% of null elements.
eg:
if the input is
1 0 0 0 5 0
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
0 0 4 0 1 0
I want to remove rows 1 and 5, but retain the rest. The output should look like:
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
I want to do this using matlab
Use logical indexing into the rows, based on the mean of the rows of A negated:
t = .5; % threshold
A(mean(A==0,2) > t, :) = [];
What this does:
Compare A with 0: turns zeros into true, and nonzeros into false.
Compute the mean of each row.
Compare that to the desired threshold.
Use the result as a logical index to delete unwanted rows.
Equivalently, you can keep the wanted rows instead of removing the unwanted ones. This may be faster depending on the proportion of rows:
A = A(mean(A~=0,2) >= 1-t, :);
You can also use the standardizeMissing function and rmmissing function together to achieve this:
>> [~,tf] = rmmissing(standardizeMissing(A,0),'MinNumMissing',floor(0.5*size(A,2))+1);
>> A(~tf,:)
The call to standardizeMissing replaces the 0 values with NaN (the standard missing indicator for double), then the rmmissing call identifies in the logical vector tf the rows that have more than 50% of their entries as 0 (i.e., those rows that have more than floor(0.5*size(A,2))+1 0-valued entries. Then you can just negate the tf output and use it as an indexer. You can adapt the minimum number missing easily to satisfy whatever percentage criteria you want.
Also note that tf is a logical vector here that is only the size of the number of rows of A.
As I mentioned on Luis' answer, one downside to his approach is that it requires an intermediate logical array of the same size as A to be created, which can potentially incur a significant memory/performance penalty when working with large arrays.
An explicit looped approach with nnz (overly verbose, for clarity):
[nrows, ncols] = size(A);
maximum_ratio_of_zeros = 0.5;
minimum_ratio_of_nonzeros = 1 - maximum_ratio_of_zeros;
todelete = false(nrows, 1);
for ii = 1:nrows
if nnz(A(ii,:))/ncols < minimum_ratio_of_nonzeros
todelete(ii) = true;
end
end
A(todelete,:) = [];
Which returns the desired answer.

Indexing into matrix with logical array

I have a matrix A, which is m x n. What I want to do is count the number of NaN elements in a row. If the number of NaN elements is greater than or equal to some arbitrary threshold, then all the values in that row will set to NaN.
num_obs = sum(isnan(rets), 2);
index = num_obs >= min_obs;
Like I say I am struggling to get my brain to work. Being trying different variations of the line below but no luck.
rets(index==0, :) = rets(index==0, :) .* NaN;
The Example data for threshold >= 1 is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
5.5 6.3 2.1 NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
and the result I want is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
NaN NaN NaN NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
Use
A = magic(4);A(3,3)=nan;
threshold=1;
for ii = 1:size(A,1) % loop over rows
if sum(isnan(A(ii,:)))>=threshold % get the nans, sum the occurances
A(ii,:)=nan(1,size(A,2)); % fill the row with column width amount of nans
end
end
Results in
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
Or, as #Obchardon mentioned in his comment you can vectorise:
A(sum(isnan(A),2)>=threshold,:) = NaN
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
As a side-note you can easily change this to columns, simply do all indexing for the other dimension:
A(:,sum(isnan(A),1)>=threshold) = NaN;
Instead of isnan function, you can use A ~= A for extracting NaN elements.
A(sum((A ~= A),2) >= t,:) = NaN
where t is the threshold for the minimum number of existing NaN elements.

Elementwise comparison of two vectors while ignoring all NaN's in between

I have two vectors 1x5000. They consist of numbers like this:
vec1 = [NaN NaN 2 NaN NaN NaN 5 NaN 8 NaN NaN 7 NaN 5 NaN 3 NaN 4]
vec2 = [NaN 2 NaN NaN 5 NaN NaN NaN 8 NaN 1 NaN NaN NaN 5 NaN NaN NaN]
I would like to check if the order of the numbers are equal, independent of the NaNs. But I do not want to remove the NaNs (Not-a-Number) since I will use them later. So now I create a new vector and call it results. Once they come in the same order, it is correct and we fill results with 1. If the next numbers are not equal we add 0 to results.
An example results would look like this for vec1 and vec2:
[1 1 1 0 1 0 0]
The first 3 numbers are the same, then 7 is compared to 1 which gives 0, then 5 compared to 5 is true which gives 1. Then the last two numbers are missing which gives 0.
One reason that I don't want to remove the NaNs is that I have a time vector 1x500 and somehow I want to get the time for each 1 and 0 (in a new vector). Is that possible too?
Help is super appreciated!
This is how I would do it:
temp1 = vec1(~isnan(vec1));
temp2 = vec2(~isnan(vec2));
m = min(numel(temp1), numel(temp2));
M = max(numel(temp1), numel(temp2));
results = [(temp1(1:m) == temp2(1:m)), false(1,M-m)];
Note that here results is a binary array. If you need it numeric, you can convert it to double for instance.
Regarding your concern about NaNs, depends on what you want to do with your arrays. If you are going to process them, it is more convenient to remove the NaNs. In order to keep the track of things you can keep the index of the kept elements:
id1 = find(~isnan(vec1));
vec1 = vec1(id1);
vec1 =
2 5 8 7 5 3 4
id1 =
3 7 9 12 14 16 18
% and same for vec2
If you decide to remove the NaNs, the solution will be the same, with all temps replaced with vec.
This would be my solution, using a mix of logical indexing and the find function. Returning the timestamps for the 1's and 0's is actually more tedious than finding the 1's and 0's.
vec1 = [NaN NaN 2 NaN NaN NaN 5 NaN 8 NaN NaN 7 NaN 5 NaN 3 NaN 4];
vec2 = [NaN 2 NaN NaN 5 NaN NaN NaN 8 NaN 1 NaN NaN NaN 5 NaN NaN NaN];
t=1:numel(vec1);
ind1=find(~isnan(vec1));
ind2=find(~isnan(vec2));
v1=vec1(ind1);
v2=vec2(ind2);
if length(v1)>length(v2)
ibig=1;
else
ibig=2;
end
n=min(length(v1),length(v2));
N=max(length(v1),length(v2));
v=false(1,N);
v(1:n)=v1(1:n)==v2(1:n);
t_ones1=t(ind1(v));
t_ones2=t(ind2(v));
if ibig==1
t_zeros1=t(ind1(~v));
t_zeros2=t(ind2(~v(1:n)));
else
t_zeros1=t(ind1(~v(1:n)));
t_zeros2=t(ind2(~v));
end

Find the value that corresponds to an index

I have an array 2549x13 double (M).
Example lines:
-7.8095 -4.4135 -0.0881 2.5159 6.3142 6.9519 4.9788 2.9109 0.6623 -0.9269 0.3172 1.2445 -0.0730
4.5819 6.2371 5.8721 6.1824 5.2074 4.8656 5.0269 5.3340 3.6919 1.3608 -0.5443 0.2871 -1.2070
-6.2273 -3.7767 1.1829 2.8522 3.2428 0.5261 -3.5535 -7.7743 -8.4391 -9.8188 -6.0503 -5.8805 -7.7700
-2.2157 -3.2100 -4.4400 -3.5898 -0.8901 3.4061 6.5631 7.2028 4.3082 -0.7742 -5.0963 -3.1837 0.4372
5.5682 5.5393 3.4691 0.6789 1.7320 4.4472 3.7622 1.0194 -0.5362 -3.1721 -6.1281 -6.3959 -6.1932
0.9707 -0.2701 -3.8883 -8.8974 -7.0375 -1.5085 5.4171 6.0831 2.9852 -2.3474 -4.5637 -3.7378 1.3236
-2.811 0.0164 2.7208 5.7862 7.3344 8.3504 9.0635 8.4271 2.7669 -2.1403 -2.2003 -0.9940 0.7729
4.2382 3.3532 3.5475 7.9209 11.7933 14.3181 13.6289 12.9553 13.7464 14.1331 14.3814 16.7949 15.9003
-0.0539 -2.7059 -3.8141 -2.7531 -1.7465 0.9190 2.2220 0.7268 1.5436 1.0426 2.3535 3.0269 6.4798
I also have the indices of some values I need, 2549x5 double(inde).
Example lines:
4 5 6 7 8
0 1 2 3 4
3 4 5 6 7
6 7 8 9 10
-1 0 1 2 3
6 7 8 9 10
5 6 7 8 9
10 11 12 13 14
11 12 13 14 15
I want now to create a new array/matrix with the actual values. So, to find in the array M the values corresponding to the indices inde.
However, if the index (in inde) is equal to zero, I would like to take the values corresponding to the indeces 1,2,3,4 of that row.
If the index is -1 or 15, I would like to insert an NaN in the new array/matrix.
If the index is 14, I would like to take the values corresponding to 10,11,12,13.
So I would like to obtain:
2.5159 6.3142 6.9519 4.9788 2.9109
NaN 4.5819 6.2371 5.8721 6.1824
1.1829 2.8522 3.2428 0.5261 -3.5535
3.4061 6.5631 7.2028 4.3082 -0.7742
NaN
-1.5085 5.4171 6.0831 2.9852 -2.3474
7.3344 8.3504 9.0635 8.4271 2.7669
14.1331 14.3814 16.7949 15.9003 NaN
NaN
Very grateful to anyone who could help with this.
This will give you the desired array:
rows = size(M, 1); % number of rows in M and inde
cols = size(inde, 2); % number of columns in inde
N = [nan(rows, 2) M nan(rows, 2)]; % pad M with 2 columns of NaN values
% on left and right
inde = inde + 2; % change indices to account for padding
P = zeros(rows, cols); % preallocate result matrix
nanrow = nan(1, cols); % make a row of all NaN values
for row_num = 1:rows
P(row_num,:) = N(row_num, inde(row_num,:)); % get values from N
if sum(isnan(P(row_num,:))) > 1 % if 2 NaN values, original index was -1 or 15
P(row_num,:) = nanrow; % so make it all NaN's
end
end
(I dislike leaving that stray 2 in there when padding, but I was unsure what the expected result was for different numbers of columns of inde, if that's even a concern. Perhaps floor(cols/2)?)
Since MATLAB won't allow you to have matrices with rows of unequal length, for rows in which there are indices of -1 or 15, I've inserted a row of all NaN values. This can obviously be changed to whatever you prefer by modifying the line inside the if clause.
Results using M and inde from your example:
P =
2.51590 6.31420 6.95190 4.97880 2.91090
NaN 4.58190 6.23710 5.87210 6.18240
1.18290 2.85220 3.24280 0.52610 -3.55350
3.40610 6.56310 7.20280 4.30820 -0.77420
NaN NaN NaN NaN NaN
-1.50850 5.41710 6.08310 2.98520 -2.34740
7.33440 8.35040 9.06350 8.42710 2.76690
14.13310 14.38140 16.79490 15.90030 NaN
NaN NaN NaN NaN NaN
EDIT
I suggest not to mix numbers and characters in your matrix since it would become a cell-structure which is harder to handle.
So I assume for the rest of my answer that you want to put zeros (or any error value, -999 for instance is sometimes used) where you want to modify your data. Assuming A is your data matrix and i your indexes' matrix :
B=zeros(size(i));
for j=1:size(i,1)
if (prod(i(j,:))==0)
k=find(i(j,:)==0);
B(j,k+1:end)= A(j,i(j,k+1:end));
m=find(i(j,:)<0);
if (~isempty(m))
B(j,:)= 0;
end
else
B(j,:)= A(j,i(j,:));
end
end
I get :
2.5159 6.3142 6.9519 4.9788 2.9109
0 4.5819 6.2372 5.8722 6.1824
1.1830 2.8522 3.2429 0.5261 -3.5535
3.4061 6.5632 7.2028 4.3083 -0.7742
0 0 0 0 0
-1.5086 5.4171 6.0831 2.9853 -2.3475
7.3344 8.3505 9.0635 8.4271 2.7670

Accumulating different sized column vectors stored as a cell array into a matrix padded with NaNs

Imagine I have a series of different sized column vectors inside an array and want to group them into a matrix by padding the empty spaces with NaN. How can I do this?
There is already an answer to a very similar problem (accumulate cells of different lengths into a matrix in MATLAB?) but that solution deals with row vectors and my problem is with column vectors. One possible solution could be transposing each of the array components and then applying the above mentioned solution. However, I have no idea how to do this.
Also, speed is a bit of an issue so if possible take that into consideration.
You can just slightly tweak that answer you found to work for columns:
tcell = {[1,2,3]', [1,2,3,4,5]', [1,2,3,4,5,6]', [1]', []'}; %\\ ignore this comment, it's just for formatting in SO
maxSize = max(cellfun(#numel,tcell));
fcn = #(x) [x; nan(maxSize-numel(x),1)];
cmat = cellfun(fcn,tcell,'UniformOutput',false);
cmat = horzcat(cmat{:})
cmat =
1 1 1 1 NaN
2 2 2 NaN NaN
3 3 3 NaN NaN
NaN 4 4 NaN NaN
NaN 5 5 NaN NaN
NaN NaN 6 NaN NaN
Or you could tweak this as an alternative:
cell2mat(cellfun(#(x)cat(1,x,NaN(maxSize-length(x),1)),tcell,'UniformOutput',false))
If you want speed the cell data structure is your enemy. For this example I will assume you have this vectors stored in a structure called vector_holder:
elements = fieldnames(vector_holder);
% Per Dan request
maximum_size = max(structfun(#max, vector_holder));
% maximum_size is the maximum length of all your separate arrays
matrix = NaN(length(elements), maximum_size);
for i = 1:length(elements)
current_length = length(vector.holder(element{i}));
matrix(i, 1:current_length) = vector.holder(element{i});
end
Many Matlab functions are slower when dealing with cell variables. In addition, a cell matrix with N double-precision elements requires more memory than a double-precision matrix with N elements.

Resources