Populate missing timestamp data rows with NAN - MATLAB - arrays

I have one dataset in which some timestamps are missing. I have written code so far as below,
x = table2dataset(Testing_data);
T1 = x(:,1);
C1 =dataset2cell(T1);
formatIn = 'yyyy-mm-dd HH:MM:SS';
t1= datenum(C1,formatIn);
% Creating 10 minutes of time interval;
avg = 10/60/24;
tnew = [t1(1):avg:t1(end)]';
indx = round((t1-t1(1))/avg) + 1;
ynew = NaN(length(tnew),1);
ynew(indx)=t1;
% replacing missing time with NaN
t = datetime(ynew,'ConvertFrom','datenum');
formatIn = 'yyyy-mm-dd HH:MM:SS';
DateVector = datevec(ynew,formatIn);
dt = datestr(ynew,'yyyy-mm-dd HH:MM:SS');
ds = string(dt);
The testing data has three parameters shown here,
Time x y
2009-04-10 02:00:00.000 1 0.1
2009-04-10 02:10:00.000 2 0.2
2009-04-10 02:30:00.000 3 0.3
2009-04-10 02:50:00.000 4 0.4
Now as you can see, for intervals of 10 minutes, there are missing timestamps (2:20 and 2:40) so I want to added that time stamp. Then I want the x and y values to be NAN. So My output would be like,
Time x y
2009-04-10 02:00:00.000 1 0.1
2009-04-10 02:10:00.000 2 0.2
2009-04-10 02:20:00.000 NaN NaN
2009-04-10 02:30:00.000 3 0.3
2009-04-10 02:40:00.000 NaN NaN
2009-04-10 02:50:00.000 4 0.4
As you can see from my code, I am just able to add NaN with time stamp but now would like to take its corresponding x and y value which I desired.
Please note I have more than 3000 data rows in the above format, I want to perform the same for my all values.

it seems to be a contradiction in your question; you say tthat you are able to insert NaN in place of the missing time string but, in the example of the expected output you wrote the time string.
Also you refer to missing time stamp (2:20) but, if the time step is 10 minutes, in your example data there is another missing time stamp (2:40)
Assuming that:
you actually want to insert the missing time sting
you want to manage all the missing timestamp
you could modify your code as follows:
the ynew time is not needed
the tnew time should be used in place of ynew
to insert the NaN values in the x and y column you have to:
extract them from the dataset
create two new array initializing them to NaN
insert the original x and y data in the location identified by indx
In the following yu can find an updated version of your code.
the x and y data are stored in the x_data and y_data array
the new x and y data are stored in the x_data_new and y_data_new array
at the end of the script, two table are generate: the first one is generated using the time as string, the second one as cellarray.
The comments in the code should identify the modifications.
x = table2dataset(Testing_data);
T1 = x(:,1);
% Get X data from the table
x_data=x(:,2)
% Get Y data from the table
y_data=x(:,3)
C1 =dataset2cell(T1);
formatIn = 'yyyy-mm-dd HH:MM:SS';
t1= datenum(C1(2:end),formatIn)
avg = 10/60/24; % Creating 10 minutes of time interval;
tnew = [t1(1):avg:t1(end)]'
indx = round((t1-t1(1))/avg) + 1
%
% Not Needed
%
% ynew = NaN(length(tnew),1);
% ynew(indx)=t1;
%
% Create the new X and Y data
%
y_data_new = NaN(length(tnew),1)
y_data_new(indx)=t1
x_data_new=nan(length(tnew),1)
x_data_new(indx)=x_data
y_data_new=nan(length(tnew),1)
y_data_new(indx)=y_data
% t = datetime(ynew,'ConvertFrom','datenum') % replacing missing time with NAN
%
% Use tnew instead of ynew
%
t = datetime(tnew,'ConvertFrom','datenum') % replacing missing time with NAN
formatIn = 'yyyy-mm-dd HH:MM:SS'
% DateVector = datevec(y_data_new,formatIn)
% dt = datestr(ynew,'yyyy-mm-dd HH:MM:SS')
%
% Use tnew instead of ynew
%
dt = datestr(tnew,'yyyy-mm-dd HH:MM:SS')
% ds = char(dt)
new_table=table(dt,x_data_new,y_data_new)
new_table_1=table(cellstr(dt),x_data_new,y_data_new)
The output is
new_table =
dt x_data_new y_data_new
___________ __________ __________
[1x19 char] 1 0.1
[1x19 char] 2 0.2
[1x19 char] NaN NaN
[1x19 char] 3 0.3
[1x19 char] NaN NaN
[1x19 char] 4 0.4
new_table_1 =
Var1 x_data_new y_data_new
_____________________ __________ __________
'2009-04-10 02:00:00' 1 0.1
'2009-04-10 02:10:00' 2 0.2
'2009-04-10 02:20:00' NaN NaN
'2009-04-10 02:30:00' 3 0.3
'2009-04-10 02:40:00' NaN NaN
'2009-04-10 02:50:00' 4 0.4
Hope this helps.
Qapla'

This example is not too different from the accepted answer, but IMHO a bit easier on the eyes. But, it supports gaps larger than 1 step, and is a bit more generic because it makes fewer assumptions.
It works with plain cell arrays instead of the original table data, so that conversion is up to you (I'm on R2010a so can't test it)
% Example data with intentional gaps of varying size
old_data = {'2009-04-10 02:00:00.000' 1 0.1
'2009-04-10 02:10:00.000' 2 0.2
'2009-04-10 02:30:00.000' 3 0.3
'2009-04-10 02:50:00.000' 4 0.4
'2009-04-10 03:10:00.000' 5 0.5
'2009-04-10 03:20:00.000' 6 0.6
'2009-04-10 03:50:00.000' 7 0.7}
% Convert textual dates to numbers we can work with more easily
old_dates = datenum(old_data(:,1));
% Nominal step size is the minimum of all differences
deltas = diff(old_dates);
nominal_step = min(deltas);
% Generate new date numbers with constant step
new_dates = old_dates(1) : nominal_step : old_dates(end);
% Determine where the gaps in the data are, and how big they are,
% taking into account rounding error
step_gaps = abs(deltas - nominal_step) > 10*eps;
gap_sizes = round( deltas(step_gaps) / nominal_step - 1);
% Create new data structure with constant-step time stamps,
% initially with the data of interest all-NAN
new_size = size(old_data,1) + sum(gap_sizes);
new_data = [cellstr( datestr(new_dates, 'yyyy-mm-dd HH:MM:SS') ),...
repmat({NaN}, new_size, 2)];
% Compute proper locations of the old data in the new data structure,
% again, taking into account rounding error
day = 86400; % (seconds in a day)
new_datapoint = ismember(round(new_dates * day), ...
round(old_dates * day));
% Insert the old data at the right locations
new_data(new_datapoint, 2:3) = data(:, 2:3)
Output is:
old_data =
'2009-04-10 02:00:00.000' [1] [0.100000000000000]
'2009-04-10 02:10:00.000' [2] [0.200000000000000]
'2009-04-10 02:30:00.000' [3] [0.300000000000000]
'2009-04-10 02:50:00.000' [4] [0.400000000000000]
'2009-04-10 03:10:00.000' [5] [0.500000000000000]
'2009-04-10 03:20:00.000' [6] [0.600000000000000]
'2009-04-10 03:50:00.000' [7] [0.700000000000000]
new_data =
'2009-04-10 02:00:00' [ 1] [0.100000000000000]
'2009-04-10 02:10:00' [ 2] [0.200000000000000]
'2009-04-10 02:20:00' [NaN] [ NaN]
'2009-04-10 02:30:00' [ 3] [0.300000000000000]
'2009-04-10 02:40:00' [NaN] [ NaN]
'2009-04-10 02:50:00' [ 4] [0.400000000000000]
'2009-04-10 03:00:00' [NaN] [ NaN]
'2009-04-10 03:10:00' [ 5] [0.500000000000000]
'2009-04-10 03:20:00' [ 6] [0.600000000000000]
'2009-04-10 03:30:00' [NaN] [ NaN]
'2009-04-10 03:40:00' [NaN] [ NaN]
'2009-04-10 03:50:00' [ 7] [0.700000000000000]

Related

Selecting elements from a vector based on condition on another vector

I want to know how to select those numbers which correspond (i.e. same position) to my pre-defined numbers.
For example, I have these vectors:
a = [ 1 0.1 2 3 0.1 0.5 4 0.1];
b = [100 200 300 400 500 600 700 800]
I need to select elements from b which correspond to the positions of the whole numbers in a (1, 2, 3 and 4), so the output must be:
output = [1 100
2 300
3 400
4 700]
How can this be done?
Create a logical index based on a, and apply it to both a and b to get the desired result:
ind = ~mod(a,1); % true for integer numbers
output = [a(ind); b(ind)].'; % build result
round(x) == x ----> x is a whole number
round(x) ~= x ----> x is not a whole number
round(2.4) = 2 ------> round(2.4) ~= 2.4 --> 2.4 is not a whole number
round(2) = 2 --------> round(2) == 2 ----> 2 is a whole number
Following same logic
a = [ 1 0.1 2 3 0.1 0.5 4 0.1];
b = [100 200 300 400 500 600 700 800 700];
iswhole = (round(a) == a);
output = [a(iswhole); b(iswhole)]
Result:
output =
1 2 3 4
100 300 400 700
we can generate logical index based on a using fix() function
ind = (a==fix(a));
output= [a(ind); b(ind)]'
Although the intention is not clear, creating indexing to the matrix is the solution
My solution is
checkint = #(x) ~isinf(x) & floor(x) == x % It's very fast in a big array
[a(checkint(a))' b(checkint(a))']
The key here is creating the index to a and b for which it is a logical vector to the integer values in a. This function checkint does a good job checking integer.
Other approaches to check integer could be
checkint = #(x)double(uint64(x))==x % Slower but it works fine
or
checkint = #(x) mod(x,1) == 0 % Slowest, but it's robust and better for understanding what's going on
or
checkint = #(x) ~mod(x,1) % Slowest, treat 0 as false
It's been discussed in many other threads.

MATLAB: How can I add NaN values to an array where data does not exist based on row/column indices from a different array?

I have two dataset arrays, A and B. They are two different, independent measurements (e.g. smell and color of some object).
For each data entry in A and B, I have a time, t, and a location, p of the measurement. The majority of the smell and color measurements were taken at the same time and location. However, there are some times where data is missing (i.e. at some time there was no color measurement and only a smell measurement). Similarly, there are some locations where some data is missing (i.e. at some location there was only color measurement and no smell measurement).
I want to build arrays of A and B which have the same size where each row corresponds to a full set of all times and each column corresponds to a full set of all locations. If there is data missing, I want that entry to be NaN.
Below is an example of what I want to do:
%Inputs
A = [0 0 1 2 4; 1 1 3 3 2; 4 4 1 0 3];
t_A = [0.03 1.6 3.9]; %Times when A was measured (rows of A)
L_A = [1.0 2.9 2.98 4.2 6.33]; %Locations where A was measured (columns of A)
B = [10 13 10 10; 15 13 13 12; 14 14 13 12; 15 19 11 13];
t_B = [0.03 1.6 1.9 3.9]; %Times when B was measured (rows of B)
L_B = [2.1 2.9 2.98 5.0]; %Locations where B was measured (columns of B)
What I want is some code to transform these datasets into the following:
t = [0.03 1.6 1.9 3.9];
L = [1.0 2.1 2.9 2.98 4.2 5.0 6.33];
A_new = [0 NaN 0 1 2 NaN 4; 1 NaN 1 3 3 NaN 2; NaN NaN NaN NaN NaN NaN NaN; 4 NaN 4 1 0 NaN 3];
B_new = [NaN 10 13 10 NaN 10 NaN; NaN 15 13 13 NaN 12 NaN; NaN 14 14 13 NaN 12 NaN; NaN 15 19 11 NaN 13 NaN];
The new arrays, A_new and B_new, are the same size and the vectors t and L (corresponding to the rows and columns) are sequential. The original A had no data at t = 1.9 and thus at the 3rd row in A_new, there is all NaN values. Similarly for the columns 2 and 6 in A_new and columns 1, 5 and 7 in B_new.
How can I do this in MATLAB quickly for a large dataset?
Create a matrix of NaNs , use third output of the unique function to convert floating numbers to integer indexes and use matrix indexing to fill the matrices:
[t,~,it] = unique([t_A t_B]);
[L,~,iL] = unique([L_A L_B]);
A_new = NaN(numel(t),numel(L));
A_new(it(1:numel(t_A)),iL(1:numel(L_A))) = A;
B_new = NaN(numel(t),numel(L));
B_new(it(numel(t_A)+1:end),iL(numel(L_A)+1:end)) = B;

in matlab put data into bins and calculate mean

In matlab, say I have the following data:
data = [4 0.1; 6 0.5; 3 0.8; 2 1.4; 7 1.6; 12 1.8; 9 1.9; 1 2.3; 5 2.5; 5 2.6];
I want to place the 1st column into bins according to elements in the 2nd column (i.e. 0-1, 1-2, 2-3...), and calculate the mean and 95% confidence interval of the elements in column 1 within that bin . So I'd have a matrix something like this:
mean lower_95% upper_95% bin
4.33 0
7.5 1
3.67 2
You can use accumarray with the appropriate function for the mean (mean) or the quantiles (quantile):
m = accumarray(floor(data(:,2))+1, data(:,1), [], #mean);
l = accumarray(floor(data(:,2))+1, data(:,1), [], #(x) quantile(x,.05));
u = accumarray(floor(data(:,2))+1, data(:,1), [], #(x) quantile(x,.95));
result = [m l u (0:numel(m)-1).'];
This can also be done calling accumarray once with cell array output:
result = accumarray(floor(data(:,2))+1, data(:,1), [],...
#(x) {[mean(x) quantile(x,.05) quantile(x,.95)]});
result = cell2mat(result);
For your example data,
result =
4.3333 3.0000 6.0000 0
7.5000 2.0000 12.0000 1.0000
3.6667 1.0000 5.0000 2.0000
This outputs a matrix with the labelled columns. Note that for your example data, 2 standard deviations from the mean (for the 95% confidence interval) gives values outside of the bands. With a larger (normally distributed) data set, you wouldn't see this.
Your data:
data = [4 0.1; 6 0.5; 3 0.8; 2 1.4; 7 1.6; 12 1.8; 9 1.9; 1 2.3; 5 2.5; 5 2.6];
Binning for output table:
% Initialise output matrix. Columns:
% Mean, lower 95%, upper 95%, bin left, bin right
bins = [0 1; 1 2; 2 3];
out = zeros(size(bins,1),5);
% Cycle through bins
for ii = 1:size(bins,1)
% Store logical array of which elements fit in given bin
% You may want to include edge case for "greater than or equal to" leftmost bin.
% Alternatively you could make the left bin equal to "left bin - eps" = -eps
bin = data(:,2) > bins(ii,1) & data(:,2) <= bins(ii,2);
% Calculate mean, and mean +- 2*std deviation for confidence intervals
out(ii,1) = mean(data(bin,2));
out(ii,2) = out(ii,1) - 2*std(data(bin,2));
out(ii,3) = out(ii,1) + 2*std(data(bin,2));
end
% Append bins to the matrix
out(:,4:5) = bins;
Output:
out =
0.4667 -0.2357 1.1690 0 1.0000
1.6750 1.2315 2.1185 1.0000 2.0000
2.4667 2.1612 2.7722 2.0000 3.0000

Summing over rows of a matrix in Matlab with the same index

I have a matrix A in Matlab of dimension hxk where element ik reports an index from {1,2,...,s<=h}. The indices can be repeated across rows. I want to obtain B of dimension sx(k-1) where element j is the sum of the rows of A(:,1:k-1) with index j. For example if
A = [0.4 5 6 0.3 1;
0.6 -0.7 3 2 2;
0.3 4.5 6 8.9 1;
0.9 0.8 0.7 3 3;
0.7 0.8 0.9 0.5 2]
the result shoud be
B = [0.7 9.5 12 9.2;
1.3 0.1 3.9 2.5;
0.9 0.8 0.7 3]
You'd need a multi-column version of accumarray. Failing that, you can use sparse as follows:
[m n] = size(A);
rows = ceil(1/(n-1):1/(n-1):m);
cols = repmat(1:n-1,1,m);
B = full(sparse(A(rows,end), cols, A(:,1:end-1).'));
cell2mat(arrayfun(#(x) sum(A(A(:,end)==x,1:end-1),1), unique(A(:,end)), 'UniformOutput', false))
The key point is selecting rows A(A(:,end)==x,1:end-1) where x is a unique element of A(:,end)

How to collect the indices of an array X that has the same lengths between elements

I am trying to make an array Z that have indexes of the most frequent occurring difference between two elements in the array X. So if the most frequent occurring difference between two elements in X is 3 then I would get all the indexes in X that have that difference into array Z.
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6];
ct = 0;
difference_x = diff(x);
unique_x = unique(difference_x);
for i = 1:length(unique_x)
for j = 1:length(x)
space_between_elements = abs(x(i)-x(i+1));
if space_between_elements == difference_x
ct = ct + 1;
space_set(i,ct) = j;
end
end
end
I DonĀ“t get the indexes of X containing the most frequent difference from this code.
It appears you want to find how many unique differences there are, with "difference" interpreted in an absoulte-value sense; and also find how many times each difference occurs.
You can do that as follows:
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6]; %// data
difference_x = abs(diff(x));
unique_x = unique(difference_x); %// all differences
counts = histc(difference_x, unique_x); %// count for each difference
However, comparing reals for uniqueness (or equality) is problematic because of finite precision. You should rather apply a tolerance to declare two values as "equal":
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6]; %// data
tol = 1e-6; %// tolerance
difference_x = abs(diff(x));
difference_x = round(difference_x/tol)*tol; %// apply tolerance
unique_x = unique(difference_x); %// all differences
counts = histc(difference_x, unique_x); %// count for each difference
With your example x, the second approach gives
>> unique_x
unique_x =
0 0.1000 0.2000 0.3000
>> counts
counts =
1 4 3 2

Resources