Method .pivot_table() returns innumerable NaN unexpectedly - pivot-table

I am pivoting several data frames. Some are pivoted correctly. Others are not.
I have 5 frames with same structure (acquired data from PLC, basically timestamps, variable name and corresponding value):
data frame
((85247, 5), (255737, 5), (255734, 5), (574065, 5), (567587, 5))
The structure of the pivoted frame has timestamp as index and columns with values (ID and quality are dropped before pivoting).
Out of 5 data frames, 3 are pivoted correctly and 2 are filled with way too many NaN values.
(85247, 1), (85258, 3), (85258, 3), (85542, 84), (85216, 13)
The code is this one:
df_WATER['TIMESTAMP'] = pd.to_datetime(df_WATER['TIMESTAMP'], errors='ignore')
df_HSD['TIMESTAMP'] = pd.to_datetime(df_HSD['TIMESTAMP'], errors='ignore')
df_HSCPLC3['TIMESTAMP'] = pd.to_datetime(df_HSCPLC3['TIMESTAMP'], errors='ignore')
df_HSCPLC2ACT['TIMESTAMP'] = pd.to_datetime(df_HSCPLC2ACT['TIMESTAMP'], errors='ignore')
df_FURNACE['TIMESTAMP'] = pd.to_datetime(df_FURNACE['TIMESTAMP'], errors='ignore')
df_WATER = df_WATER.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSD = df_HSD.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSCPLC3 = df_HSCPLC3.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSCPLC2ACT = df_HSCPLC2ACT.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_FURNACE = df_FURNACE.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
It happens with the data frames whose number of columns exceeds 10 ((85542, 84), (85216, 13)). I wonder this may be a limit of this function.
TIMESTAMP
2022-11-18 20:00:00.224 NaN
2022-11-18 20:00:00.731 NaN
2022-11-18 20:00:01.240 NaN
2022-11-18 20:00:01.751 NaN
2022-11-18 20:00:02.259 NaN
... ...
2022-11-19 07:59:57.906 NaN
2022-11-19 07:59:58.411 NaN
2022-11-19 07:59:58.920 NaN
2022-11-19 07:59:59.420 NaN
2022-11-19 07:59:59.927 NaN

Related

MATLAB: How can I add NaN values to an array where data does not exist based on row/column indices from a different array?

I have two dataset arrays, A and B. They are two different, independent measurements (e.g. smell and color of some object).
For each data entry in A and B, I have a time, t, and a location, p of the measurement. The majority of the smell and color measurements were taken at the same time and location. However, there are some times where data is missing (i.e. at some time there was no color measurement and only a smell measurement). Similarly, there are some locations where some data is missing (i.e. at some location there was only color measurement and no smell measurement).
I want to build arrays of A and B which have the same size where each row corresponds to a full set of all times and each column corresponds to a full set of all locations. If there is data missing, I want that entry to be NaN.
Below is an example of what I want to do:
%Inputs
A = [0 0 1 2 4; 1 1 3 3 2; 4 4 1 0 3];
t_A = [0.03 1.6 3.9]; %Times when A was measured (rows of A)
L_A = [1.0 2.9 2.98 4.2 6.33]; %Locations where A was measured (columns of A)
B = [10 13 10 10; 15 13 13 12; 14 14 13 12; 15 19 11 13];
t_B = [0.03 1.6 1.9 3.9]; %Times when B was measured (rows of B)
L_B = [2.1 2.9 2.98 5.0]; %Locations where B was measured (columns of B)
What I want is some code to transform these datasets into the following:
t = [0.03 1.6 1.9 3.9];
L = [1.0 2.1 2.9 2.98 4.2 5.0 6.33];
A_new = [0 NaN 0 1 2 NaN 4; 1 NaN 1 3 3 NaN 2; NaN NaN NaN NaN NaN NaN NaN; 4 NaN 4 1 0 NaN 3];
B_new = [NaN 10 13 10 NaN 10 NaN; NaN 15 13 13 NaN 12 NaN; NaN 14 14 13 NaN 12 NaN; NaN 15 19 11 NaN 13 NaN];
The new arrays, A_new and B_new, are the same size and the vectors t and L (corresponding to the rows and columns) are sequential. The original A had no data at t = 1.9 and thus at the 3rd row in A_new, there is all NaN values. Similarly for the columns 2 and 6 in A_new and columns 1, 5 and 7 in B_new.
How can I do this in MATLAB quickly for a large dataset?
Create a matrix of NaNs , use third output of the unique function to convert floating numbers to integer indexes and use matrix indexing to fill the matrices:
[t,~,it] = unique([t_A t_B]);
[L,~,iL] = unique([L_A L_B]);
A_new = NaN(numel(t),numel(L));
A_new(it(1:numel(t_A)),iL(1:numel(L_A))) = A;
B_new = NaN(numel(t),numel(L));
B_new(it(numel(t_A)+1:end),iL(numel(L_A)+1:end)) = B;

Indexing into matrix with logical array

I have a matrix A, which is m x n. What I want to do is count the number of NaN elements in a row. If the number of NaN elements is greater than or equal to some arbitrary threshold, then all the values in that row will set to NaN.
num_obs = sum(isnan(rets), 2);
index = num_obs >= min_obs;
Like I say I am struggling to get my brain to work. Being trying different variations of the line below but no luck.
rets(index==0, :) = rets(index==0, :) .* NaN;
The Example data for threshold >= 1 is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
5.5 6.3 2.1 NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
and the result I want is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
NaN NaN NaN NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
Use
A = magic(4);A(3,3)=nan;
threshold=1;
for ii = 1:size(A,1) % loop over rows
if sum(isnan(A(ii,:)))>=threshold % get the nans, sum the occurances
A(ii,:)=nan(1,size(A,2)); % fill the row with column width amount of nans
end
end
Results in
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
Or, as #Obchardon mentioned in his comment you can vectorise:
A(sum(isnan(A),2)>=threshold,:) = NaN
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
As a side-note you can easily change this to columns, simply do all indexing for the other dimension:
A(:,sum(isnan(A),1)>=threshold) = NaN;
Instead of isnan function, you can use A ~= A for extracting NaN elements.
A(sum((A ~= A),2) >= t,:) = NaN
where t is the threshold for the minimum number of existing NaN elements.

complete rows across multiple arrays in matlab

I have two arrays and I need to count the number of rows that do not contain an NaN in any column in either array. I want sample size after using an array of inputs to train a vector of targets (where NaN rows are not used). Here is an example of my current solution:
% A matrix
A = [
-0.0057 14.8750 293.2000 2.3743 0 NaN -0.1186 NaN 38.1000
2.1543 10.2240 294.0200 1.7650 0 NaN 0.0962 NaN 30.4800
2.6071 7.1014 266.4000 1.3941 0 NaN -0.1110 23.6660 27.9400
0.9736 10.5730 271.2000 1.8700 0 NaN -0.2457 31.7290 27.9400
-0.7138 13.6430 286.3100 2.0655 0 NaN -0.5152 44.3640 27.9400
4.4969 5.5410 280.1600 0.6042 0 NaN -0.2783 47.9240 27.9400
5.4186 2.5648 251.6900 0.2323 0 NaN -0.0879 39.6710 25.4000
4.3641 3.4062 266.7800 0.5696 0 NaN -0.0638 26.9330 25.4000
-0.3348 8.2900 258.8900 1.3736 0 NaN -0.0414 59.2570 25.4000
0.3007 8.3617 274.7400 1.3929 0 NaN -0.3473 46.6710 25.4000
3.0400 4.6077 267.3400 0.9704 0 0.5178 -0.2080 32.4850 25.4000
2.1950 7.7303 253.8300 1.3545 0 0.4927 -0.0870 31.4520 25.4000
-0.4413 4.2283 275.7400 0.4724 0 0.3687 -0.2470 40.3630 27.9400
-0.8667 4.0397 261.0800 0.6118 0 0.4143 -0.4723 28.7360 27.9400
-8.0407 2.2782 158.9600 0.4654 0 0.1775 -0.9863 56.7880 30.4800
-15.4630 2.0072 230.4100 0.2572 0 0.0530 -2.2110 71.3660 35.5600
-14.7670 6.6983 293.4800 0.9218 0 0.1224 -4.3823 42.2330 38.1000
-8.5713 4.2573 249.6900 0.5928 0 0.2057 -4.6927 37.2790 38.1000
-13.4820 1.4811 120.2200 0.2327 0 0.0542 -4.1213 76.5140 38.1000
-15.6230 3.9040 300.8400 0.2369 0 0.0602 -3.4780 71.9860 NaN]
% And a vector of inputs
B = [
NaN
NaN
1.2009
0.6404
0.5739
0.6846
0.4121
0.7475
0.5931
0.5706
0.8581
0.9910
NaN
0.5652
0.4008
NaN
0.4585
0.5463
0.2903
0.3150]
% Inputs
Alogic = isnan(A); % logical matrix of nans for drivers used
AlogicNaNSum = sum(Alogic,2); % sum by row
ANaNSumlogic = AlogicNaNSum >0; % logical by row with 0 for complete, 1 for some row containing NaNs
% Target
Blogic = isnan(B); % logical version of target with 0 for complete, 1 for containing NaN
SumNaNrows = Blogic + ANaNSumlogic; % add logical vectors, with 0 meaning no NaNs in any column
% Final number of rows with no NaN in any column
complete = sum(SumNaNrows(:)==0)
It seems like there should be a more elegant way to do this (fewer lines of code) that could still apply to vectors and/or matrices of the same length. There are many posts already about finding and replacing NaN rows like this and this, but I haven't found as much about counting the total number of complete rows across arrays.
You can do this using some basic logical operations. As you've shown we can use isnan to create a logical matrix the size of your input where it's true where there is a NaN. We can then use any combined with the second input to check which rows have any NaN values in them. We can then use the element-wise or (|) to create a logical matrix where we want the result to be true if a row in A has a NaN value or there is a NaN value in the corresponding location in B.
toremove = any(isnan(A), 2) | isnan(B);
Then if you simply want the number of rows that match:
complete = sum(~(any(isnan(A), 2) | isnan(B)));
You could also flip the logic around a little bit and check for rows that have no NaN values. The results will be the same
tokeep = all(~isnan(A), 2) & ~isnan(B);
complete = sum(tokeep);
Yet another alternative would be to simply append B as a new column of A and just check the resulting matrix for rows which don't contain any NaN values
tokeep = ~any(isnan([A, B]), 2)

Assign rows and column values to a NaN matrix in specific locations

I have a NaN (155*135) matrix, and another matrix showing a specific value with row and column numbers. Is there a way that I can assign these values back to the NaN matrix eventually having the same location and everything else remaining as NaN?
R C Value
19 4 -1133.803
20 4 -295.6810
32 4 -1906.021
20 5 -1027.048
21 5 -293.0065
32 5 236.0525
33 5 -425.1248
Use sub2ind:
data = [
% R C Value
19 4 -1133.803
20 4 -295.6810
32 4 -1906.021
20 5 -1027.048
21 5 -293.0065
32 5 236.0525
33 5 -425.1248];
N = nan(155,135);
N(sub2ind(size(N),data(:,1),data(:,2))) = data(:,3);
So you get for N(min(data(:,1)):max(data(:,1)),min(data(:,2)):max(data(:,2))) (i.e. N(19:32,4:5)):
ans =
-1133.8 NaN
-295.68 -1027
NaN -293.01
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN
-1906 236.05
NaN -425.12
You can use accumarray:
result = accumarray([R C] , Value,[155,135],[],NaN)
Note: R and C assumed to be column vectors

Elementwise comparison of two vectors while ignoring all NaN's in between

I have two vectors 1x5000. They consist of numbers like this:
vec1 = [NaN NaN 2 NaN NaN NaN 5 NaN 8 NaN NaN 7 NaN 5 NaN 3 NaN 4]
vec2 = [NaN 2 NaN NaN 5 NaN NaN NaN 8 NaN 1 NaN NaN NaN 5 NaN NaN NaN]
I would like to check if the order of the numbers are equal, independent of the NaNs. But I do not want to remove the NaNs (Not-a-Number) since I will use them later. So now I create a new vector and call it results. Once they come in the same order, it is correct and we fill results with 1. If the next numbers are not equal we add 0 to results.
An example results would look like this for vec1 and vec2:
[1 1 1 0 1 0 0]
The first 3 numbers are the same, then 7 is compared to 1 which gives 0, then 5 compared to 5 is true which gives 1. Then the last two numbers are missing which gives 0.
One reason that I don't want to remove the NaNs is that I have a time vector 1x500 and somehow I want to get the time for each 1 and 0 (in a new vector). Is that possible too?
Help is super appreciated!
This is how I would do it:
temp1 = vec1(~isnan(vec1));
temp2 = vec2(~isnan(vec2));
m = min(numel(temp1), numel(temp2));
M = max(numel(temp1), numel(temp2));
results = [(temp1(1:m) == temp2(1:m)), false(1,M-m)];
Note that here results is a binary array. If you need it numeric, you can convert it to double for instance.
Regarding your concern about NaNs, depends on what you want to do with your arrays. If you are going to process them, it is more convenient to remove the NaNs. In order to keep the track of things you can keep the index of the kept elements:
id1 = find(~isnan(vec1));
vec1 = vec1(id1);
vec1 =
2 5 8 7 5 3 4
id1 =
3 7 9 12 14 16 18
% and same for vec2
If you decide to remove the NaNs, the solution will be the same, with all temps replaced with vec.
This would be my solution, using a mix of logical indexing and the find function. Returning the timestamps for the 1's and 0's is actually more tedious than finding the 1's and 0's.
vec1 = [NaN NaN 2 NaN NaN NaN 5 NaN 8 NaN NaN 7 NaN 5 NaN 3 NaN 4];
vec2 = [NaN 2 NaN NaN 5 NaN NaN NaN 8 NaN 1 NaN NaN NaN 5 NaN NaN NaN];
t=1:numel(vec1);
ind1=find(~isnan(vec1));
ind2=find(~isnan(vec2));
v1=vec1(ind1);
v2=vec2(ind2);
if length(v1)>length(v2)
ibig=1;
else
ibig=2;
end
n=min(length(v1),length(v2));
N=max(length(v1),length(v2));
v=false(1,N);
v(1:n)=v1(1:n)==v2(1:n);
t_ones1=t(ind1(v));
t_ones2=t(ind2(v));
if ibig==1
t_zeros1=t(ind1(~v));
t_zeros2=t(ind2(~v(1:n)));
else
t_zeros1=t(ind1(~v(1:n)));
t_zeros2=t(ind2(~v));
end

Resources