I've been fumbling with this for a while now but let me start with the beginning.
This is how the first few lines of my data look like:
Year Action_mean_global Adventure_mean_global Fighting_mean_global Misc_mean_global Platform_mean_global
1 1980 0.3400000 NaN 0.77 0.6775 NaN
2 1981 0.5936000 NaN NaN NaN 2.3100
3 1982 0.3622222 NaN NaN 0.8700 1.0060
4 1983 0.4085714 0.4 NaN 2.1400 1.3860
5 1984 1.8500000 NaN NaN 1.4500 0.6900
6 1985 1.7600000 NaN 1.05 NaN 10.7925
Puzzle_mean_global Racing_mean_global Roleplaying_mean_global Shooter_mean_global Simulation_mean_global
1 NaN NaN NaN 3.53500 NaN
2 1.120000 0.480000 NaN 1.00400 0.45
3 3.343333 0.785000 NaN 0.75800 NaN
4 0.780000 NaN NaN 0.48000 NaN
5 1.046667 1.983333 NaN 10.36667 NaN
6 0.802500 NaN NaN 1.00000 0.03
Sports_mean_global Strategy_mean_global Total_mean_global
1 0.4900 NaN 1.2644444
2 0.1975 NaN 0.7776087
3 0.5250 NaN 0.8016667
4 3.2000 NaN 0.9876471
5 3.0900 NaN 3.5971429
6 1.9600 NaN 3.8528571
They are all numeric.
Now, I simply wanted to do a plot with ggplot() + geom_line() to visualize change over year per genre. It works when doing it step by step, i.e.:
ggplot(df)+
geom_line(aes_string(x = 'Year', y = plot_vector[1]))
geom_line(aes_string(x = 'Year', y = plot_vector[2]))+
geom_line(aes_string(x = 'Year', y = plot_vector[3]))+
geom_line(aes_string(x = 'Year', y = plot_vector[4]))+
geom_line(aes_string(x = 'Year', y = plot_vector[5]))+
geom_line(aes_string(x = 'Year', y = plot_vector[6]))+
geom_line(aes_string(x = 'Year', y = plot_vector[7]))+
geom_line(aes_string(x = 'Year', y = plot_vector[8]))+
geom_line(aes_string(x = 'Year', y = plot_vector[9]))+
geom_line(aes_string(x = 'Year', y = plot_vector[10]))+
geom_line(aes_string(x = 'Year', y = plot_vector[11]))+
geom_line(aes_string(x = 'Year', y = plot_vector[12]))
(plot_vector simply contains all column-names except for Year)
However, doing it in a for-loop:
p1 <- ggplot(df)+
geom_line(aes_string(x = 'Year', y = plot_vector[1]))
for (plotnumber in 2:length(plot_vector))
{
p1 <- p1 +
geom_line(aes_string(x = 'Year', y = plot_vector[plotnumber]))
}
I get the error message. Anyone can muster an idea?
Adding lines with a for loop to a ggplot object may have caused the reported error message but has a general problem caused by lazy evaluation. This has been asked frequently on SO, see, e.g., “for” loop only adds the final ggplot layer, or ggplot loop adding curves fails, but works one at a time.
However, ggplot2 works best when data are supplied in long format. Here, melt()from the data.table package is used to reshape df:
library(data.table)
molten <- melt(setDT(df), id.vars = c("Year"))
library(ggplot2)
ggplot(molten, aes(x = Year, y = value, group = variable, colour = variable)) +
geom_line()
This creates the following chart:
Data
df <- structure(list(Year = 1980:1985, Action_mean_global = c(0.34,
0.5936, 0.3622222, 0.4085714, 1.85, 1.76), Adventure_mean_global = c(NaN,
NaN, NaN, 0.4, NaN, NaN), Fighting_mean_global = c(0.77, NaN,
NaN, NaN, NaN, 1.05), Misc_mean_global = c(0.6775, NaN, 0.87,
2.14, 1.45, NaN), Platform_mean_global = c(NaN, 2.31, 1.006,
1.386, 0.69, 10.7925), Puzzle_mean_global = c(NaN, 1.12, 3.343333,
0.78, 1.046667, 0.8025), Racing_mean_global = c(NaN, 0.48, 0.785,
NaN, 1.983333, NaN), Roleplaying_mean_global = c(NaN, NaN, NaN,
NaN, NaN, NaN), Shooter_mean_global = c(3.535, 1.004, 0.758,
0.48, 10.36667, 1), Simulation_mean_global = c(NaN, 0.45, NaN,
NaN, NaN, 0.03), Sports_mean_global = c(0.49, 0.1975, 0.525,
3.2, 3.09, 1.96), Strategy_mean_global = c(NaN, NaN, NaN, NaN,
NaN, NaN), Total_mean_global = c(1.2644444, 0.7776087, 0.8016667,
0.9876471, 3.5971429, 3.8528571)), .Names = c("Year", "Action_mean_global",
"Adventure_mean_global", "Fighting_mean_global", "Misc_mean_global",
"Platform_mean_global", "Puzzle_mean_global", "Racing_mean_global",
"Roleplaying_mean_global", "Shooter_mean_global", "Simulation_mean_global",
"Sports_mean_global", "Strategy_mean_global", "Total_mean_global"
), class = "data.frame", row.names = c(NA, -6L))
Related
I am pivoting several data frames. Some are pivoted correctly. Others are not.
I have 5 frames with same structure (acquired data from PLC, basically timestamps, variable name and corresponding value):
data frame
((85247, 5), (255737, 5), (255734, 5), (574065, 5), (567587, 5))
The structure of the pivoted frame has timestamp as index and columns with values (ID and quality are dropped before pivoting).
Out of 5 data frames, 3 are pivoted correctly and 2 are filled with way too many NaN values.
(85247, 1), (85258, 3), (85258, 3), (85542, 84), (85216, 13)
The code is this one:
df_WATER['TIMESTAMP'] = pd.to_datetime(df_WATER['TIMESTAMP'], errors='ignore')
df_HSD['TIMESTAMP'] = pd.to_datetime(df_HSD['TIMESTAMP'], errors='ignore')
df_HSCPLC3['TIMESTAMP'] = pd.to_datetime(df_HSCPLC3['TIMESTAMP'], errors='ignore')
df_HSCPLC2ACT['TIMESTAMP'] = pd.to_datetime(df_HSCPLC2ACT['TIMESTAMP'], errors='ignore')
df_FURNACE['TIMESTAMP'] = pd.to_datetime(df_FURNACE['TIMESTAMP'], errors='ignore')
df_WATER = df_WATER.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSD = df_HSD.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSCPLC3 = df_HSCPLC3.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_HSCPLC2ACT = df_HSCPLC2ACT.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
df_FURNACE = df_FURNACE.pivot_table(index='TIMESTAMP', columns='NAME', values='VALUE')
It happens with the data frames whose number of columns exceeds 10 ((85542, 84), (85216, 13)). I wonder this may be a limit of this function.
TIMESTAMP
2022-11-18 20:00:00.224 NaN
2022-11-18 20:00:00.731 NaN
2022-11-18 20:00:01.240 NaN
2022-11-18 20:00:01.751 NaN
2022-11-18 20:00:02.259 NaN
... ...
2022-11-19 07:59:57.906 NaN
2022-11-19 07:59:58.411 NaN
2022-11-19 07:59:58.920 NaN
2022-11-19 07:59:59.420 NaN
2022-11-19 07:59:59.927 NaN
I want to export a single date and variable fwi of a xarray dataset ds to a tif file. However, my dataarray has too many dimensions and I cannot figure out how to effectively remove the dimension lsm_time.
ds
<xarray.Dataset>
Dimensions: (time: 2436, y: 28, x: 58, lsm_time: 1)
Coordinates:
* time (time) datetime64[ns] 2015-05-02T12:00:00 ... 2021-12-31T12:...
step <U8 '12:00:00'
surface float64 0.0
* y (y) float64 36.01 36.11 36.21 36.31 ... 38.41 38.51 38.61 38.71
* x (x) float64 -7.64 -7.54 -7.44 -7.34 ... -2.24 -2.14 -2.04 -1.94
spatial_ref int32 0
Dimensions without coordinates: lsm_time
Data variables:
ffmc (time, y, x, lsm_time) float32 nan nan nan ... 88.93 88.53
dmc (time, y, x, lsm_time) float32 nan nan nan ... 6.511 7.908 8.45
dc (time, y, x, lsm_time) float32 nan nan nan ... 406.2 428.5
isi (time, y, x, lsm_time) float32 nan nan nan ... 3.872 3.852
bui (time, y, x, lsm_time) float32 nan nan nan ... 15.08 16.11
fwi (time, y, x, lsm_time) float32 nan nan nan ... 5.303 5.486
Exporting the dataarray raises the error TooManyDimensions:
ds.fwi.sel(time="07-14-2021").rio.to_raster("file_out.tif")
raise TooManyDimensions(
rioxarray.exceptions.TooManyDimensions: Only 2D and 3D data arrays supported. Data variable: fwi
I already dropped the dimension lsm_time in a previous step, when I masked the dataset ds with a land sea mask lsm (single date), and had to duplicate/add the time dimension of the dataset ds. So maybe this is an error resulting from this data handling.. However, I could figure out how to mask otherwise.
lsm
Dimensions: (x: 58, y: 28, time: 1)
Coordinates:
* x (x) float64 -7.64 -7.54 -7.44 -7.34 ... -2.24 -2.14 -2.04 -1.94
* y (y) float64 36.01 36.11 36.21 36.31 ... 38.41 38.51 38.61 38.71
* time (time) datetime64[ns] 2013-11-29
Data variables:
spatial_ref int32 0
lsm (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 0.9996
lsm = lsm.expand_dims({"new_time" : ds.time})
lsm = lsm.rename({"time":"lsm_time"}).rename({"new_time" : "time"}).drop("lsm_time") #issue here: drop_dims removes variable..
ds = ds.where(lsm>=threshold)
So I already applied .drop("lsm_time")
But, it is still
print(ds.fwi.sel(time="07-14-2021").dims)
> ('time', 'y', 'x', 'lsm_time')
When trying .drop_dims, it removes all data variables.
ds.drop_dims('lsm_time')
<xarray.Dataset>
Dimensions: (time: 2436, y: 28, x: 58)
Coordinates:
* time (time) datetime64[ns] 2015-05-02T12:00:00 ... 2021-12-31T12:...
step <U8 '12:00:00'
surface float64 0.0
* y (y) float64 36.01 36.11 36.21 36.31 ... 38.41 38.51 38.61 38.71
* x (x) float64 -7.64 -7.54 -7.44 -7.34 ... -2.24 -2.14 -2.04 -1.94
spatial_ref int32 0
Data variables:
*empty*
What am I missing or what did I do wrong? Thanks for any help!
Dimensions with size 1 can be removed using the .squeeze method.
Conversely, you can add a dimension of size 1 using .expand_dims
import xarray as xr
x = xr.tutorial.load_dataset("rasm")
y = x.isel(time=slice(0,1))
y.dims
# Frozen({'time': 1, 'y': 205, 'x': 275})
y.squeeze().dims
# Frozen({'y': 205, 'x': 275})
y.squeeze().expand_dims("newdim").dims
# Frozen({'y': 205, 'x': 275, 'newdim': 1})
I have a matrix A, which is m x n. What I want to do is count the number of NaN elements in a row. If the number of NaN elements is greater than or equal to some arbitrary threshold, then all the values in that row will set to NaN.
num_obs = sum(isnan(rets), 2);
index = num_obs >= min_obs;
Like I say I am struggling to get my brain to work. Being trying different variations of the line below but no luck.
rets(index==0, :) = rets(index==0, :) .* NaN;
The Example data for threshold >= 1 is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
5.5 6.3 2.1 NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
and the result I want is:
A = [-7 -8 1.6 11.9;
NaN NaN NaN NaN;
NaN NaN NaN NaN;
5.5 4.2 2.2 5.6;
NaN NaN NaN NaN];
Use
A = magic(4);A(3,3)=nan;
threshold=1;
for ii = 1:size(A,1) % loop over rows
if sum(isnan(A(ii,:)))>=threshold % get the nans, sum the occurances
A(ii,:)=nan(1,size(A,2)); % fill the row with column width amount of nans
end
end
Results in
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
Or, as #Obchardon mentioned in his comment you can vectorise:
A(sum(isnan(A),2)>=threshold,:) = NaN
A =
16 2 3 13
5 11 10 8
NaN NaN NaN NaN
4 14 15 1
As a side-note you can easily change this to columns, simply do all indexing for the other dimension:
A(:,sum(isnan(A),1)>=threshold) = NaN;
Instead of isnan function, you can use A ~= A for extracting NaN elements.
A(sum((A ~= A),2) >= t,:) = NaN
where t is the threshold for the minimum number of existing NaN elements.
I have one dataset in which some timestamps are missing. I have written code so far as below,
x = table2dataset(Testing_data);
T1 = x(:,1);
C1 =dataset2cell(T1);
formatIn = 'yyyy-mm-dd HH:MM:SS';
t1= datenum(C1,formatIn);
% Creating 10 minutes of time interval;
avg = 10/60/24;
tnew = [t1(1):avg:t1(end)]';
indx = round((t1-t1(1))/avg) + 1;
ynew = NaN(length(tnew),1);
ynew(indx)=t1;
% replacing missing time with NaN
t = datetime(ynew,'ConvertFrom','datenum');
formatIn = 'yyyy-mm-dd HH:MM:SS';
DateVector = datevec(ynew,formatIn);
dt = datestr(ynew,'yyyy-mm-dd HH:MM:SS');
ds = string(dt);
The testing data has three parameters shown here,
Time x y
2009-04-10 02:00:00.000 1 0.1
2009-04-10 02:10:00.000 2 0.2
2009-04-10 02:30:00.000 3 0.3
2009-04-10 02:50:00.000 4 0.4
Now as you can see, for intervals of 10 minutes, there are missing timestamps (2:20 and 2:40) so I want to added that time stamp. Then I want the x and y values to be NAN. So My output would be like,
Time x y
2009-04-10 02:00:00.000 1 0.1
2009-04-10 02:10:00.000 2 0.2
2009-04-10 02:20:00.000 NaN NaN
2009-04-10 02:30:00.000 3 0.3
2009-04-10 02:40:00.000 NaN NaN
2009-04-10 02:50:00.000 4 0.4
As you can see from my code, I am just able to add NaN with time stamp but now would like to take its corresponding x and y value which I desired.
Please note I have more than 3000 data rows in the above format, I want to perform the same for my all values.
it seems to be a contradiction in your question; you say tthat you are able to insert NaN in place of the missing time string but, in the example of the expected output you wrote the time string.
Also you refer to missing time stamp (2:20) but, if the time step is 10 minutes, in your example data there is another missing time stamp (2:40)
Assuming that:
you actually want to insert the missing time sting
you want to manage all the missing timestamp
you could modify your code as follows:
the ynew time is not needed
the tnew time should be used in place of ynew
to insert the NaN values in the x and y column you have to:
extract them from the dataset
create two new array initializing them to NaN
insert the original x and y data in the location identified by indx
In the following yu can find an updated version of your code.
the x and y data are stored in the x_data and y_data array
the new x and y data are stored in the x_data_new and y_data_new array
at the end of the script, two table are generate: the first one is generated using the time as string, the second one as cellarray.
The comments in the code should identify the modifications.
x = table2dataset(Testing_data);
T1 = x(:,1);
% Get X data from the table
x_data=x(:,2)
% Get Y data from the table
y_data=x(:,3)
C1 =dataset2cell(T1);
formatIn = 'yyyy-mm-dd HH:MM:SS';
t1= datenum(C1(2:end),formatIn)
avg = 10/60/24; % Creating 10 minutes of time interval;
tnew = [t1(1):avg:t1(end)]'
indx = round((t1-t1(1))/avg) + 1
%
% Not Needed
%
% ynew = NaN(length(tnew),1);
% ynew(indx)=t1;
%
% Create the new X and Y data
%
y_data_new = NaN(length(tnew),1)
y_data_new(indx)=t1
x_data_new=nan(length(tnew),1)
x_data_new(indx)=x_data
y_data_new=nan(length(tnew),1)
y_data_new(indx)=y_data
% t = datetime(ynew,'ConvertFrom','datenum') % replacing missing time with NAN
%
% Use tnew instead of ynew
%
t = datetime(tnew,'ConvertFrom','datenum') % replacing missing time with NAN
formatIn = 'yyyy-mm-dd HH:MM:SS'
% DateVector = datevec(y_data_new,formatIn)
% dt = datestr(ynew,'yyyy-mm-dd HH:MM:SS')
%
% Use tnew instead of ynew
%
dt = datestr(tnew,'yyyy-mm-dd HH:MM:SS')
% ds = char(dt)
new_table=table(dt,x_data_new,y_data_new)
new_table_1=table(cellstr(dt),x_data_new,y_data_new)
The output is
new_table =
dt x_data_new y_data_new
___________ __________ __________
[1x19 char] 1 0.1
[1x19 char] 2 0.2
[1x19 char] NaN NaN
[1x19 char] 3 0.3
[1x19 char] NaN NaN
[1x19 char] 4 0.4
new_table_1 =
Var1 x_data_new y_data_new
_____________________ __________ __________
'2009-04-10 02:00:00' 1 0.1
'2009-04-10 02:10:00' 2 0.2
'2009-04-10 02:20:00' NaN NaN
'2009-04-10 02:30:00' 3 0.3
'2009-04-10 02:40:00' NaN NaN
'2009-04-10 02:50:00' 4 0.4
Hope this helps.
Qapla'
This example is not too different from the accepted answer, but IMHO a bit easier on the eyes. But, it supports gaps larger than 1 step, and is a bit more generic because it makes fewer assumptions.
It works with plain cell arrays instead of the original table data, so that conversion is up to you (I'm on R2010a so can't test it)
% Example data with intentional gaps of varying size
old_data = {'2009-04-10 02:00:00.000' 1 0.1
'2009-04-10 02:10:00.000' 2 0.2
'2009-04-10 02:30:00.000' 3 0.3
'2009-04-10 02:50:00.000' 4 0.4
'2009-04-10 03:10:00.000' 5 0.5
'2009-04-10 03:20:00.000' 6 0.6
'2009-04-10 03:50:00.000' 7 0.7}
% Convert textual dates to numbers we can work with more easily
old_dates = datenum(old_data(:,1));
% Nominal step size is the minimum of all differences
deltas = diff(old_dates);
nominal_step = min(deltas);
% Generate new date numbers with constant step
new_dates = old_dates(1) : nominal_step : old_dates(end);
% Determine where the gaps in the data are, and how big they are,
% taking into account rounding error
step_gaps = abs(deltas - nominal_step) > 10*eps;
gap_sizes = round( deltas(step_gaps) / nominal_step - 1);
% Create new data structure with constant-step time stamps,
% initially with the data of interest all-NAN
new_size = size(old_data,1) + sum(gap_sizes);
new_data = [cellstr( datestr(new_dates, 'yyyy-mm-dd HH:MM:SS') ),...
repmat({NaN}, new_size, 2)];
% Compute proper locations of the old data in the new data structure,
% again, taking into account rounding error
day = 86400; % (seconds in a day)
new_datapoint = ismember(round(new_dates * day), ...
round(old_dates * day));
% Insert the old data at the right locations
new_data(new_datapoint, 2:3) = data(:, 2:3)
Output is:
old_data =
'2009-04-10 02:00:00.000' [1] [0.100000000000000]
'2009-04-10 02:10:00.000' [2] [0.200000000000000]
'2009-04-10 02:30:00.000' [3] [0.300000000000000]
'2009-04-10 02:50:00.000' [4] [0.400000000000000]
'2009-04-10 03:10:00.000' [5] [0.500000000000000]
'2009-04-10 03:20:00.000' [6] [0.600000000000000]
'2009-04-10 03:50:00.000' [7] [0.700000000000000]
new_data =
'2009-04-10 02:00:00' [ 1] [0.100000000000000]
'2009-04-10 02:10:00' [ 2] [0.200000000000000]
'2009-04-10 02:20:00' [NaN] [ NaN]
'2009-04-10 02:30:00' [ 3] [0.300000000000000]
'2009-04-10 02:40:00' [NaN] [ NaN]
'2009-04-10 02:50:00' [ 4] [0.400000000000000]
'2009-04-10 03:00:00' [NaN] [ NaN]
'2009-04-10 03:10:00' [ 5] [0.500000000000000]
'2009-04-10 03:20:00' [ 6] [0.600000000000000]
'2009-04-10 03:30:00' [NaN] [ NaN]
'2009-04-10 03:40:00' [NaN] [ NaN]
'2009-04-10 03:50:00' [ 7] [0.700000000000000]
I have the following matrix
xx = [ 1 2 3 4; NaN NaN 7 8];
I want to change xx to be:
yy = [ NaN NaN NaN NaN; 88 88 NaN NaN];
I have the following script
for i = 1:2;
for j = 1:4;
if (xx(i,j) ~= NaN)
yy(i,j) = NaN;
else
yy(i,j) = 88;
end
end
end
xx
yy
But I have the unwanted result, because
yy =
NaN NaN NaN NaN
NaN NaN NaN NaN
Thanks a lot for your help
No need for loops. Just use logical indexing:
yy = xx; % initiallize yy to xx
ind = isnan(xx); % logical index of NaN values in xx
yy(ind) = 88; % replace NaN with 88
yy(~ind) = NaN; % replace numbers with NaN
Anyway, the problem with your code is that xx(i,j) ~= NaN always gives true. NaN doesn't equal anything, by definition. To check if a value is NaN you need the isnan function. So you should use ~isnan(xx(i,j)) in your code:
for i = 1:2;
for j = 1:4;
if ~isnan(xx(i,j))
yy(i,j) = NaN;
else
yy(i,j) = 88;
end
end
end
Also, consider preallocating yy for speed. For example, you could initiallize yy with all entries equal to 88, and then you can remove the else branch:
yy = repmat(88, size(xx));
for i = 1:2;
for j = 1:4;
if ~isnan(xx(i,j))
yy(i,j) = NaN;
end
end
end