I'm trying to extract the forecast residuals using fabletools package. I know that I can extract the fitted model residuals using the augment() function but I don't know how that works for the forecasted values and I get the same results as the fitted model residuals. Here is an example:
library(fable)
library(tsibble)
lung_deaths <- as_tsibble(cbind(mdeaths, fdeaths))
## fitted model residuals
lung_deaths %>%
dplyr::filter(index < yearmonth("1979 Jan")) %>%
model(
ets = ETS(value ~ error("M") + trend("A") + season("A"))) %>%
augment()
# A tsibble: 120 x 7 [1M]
# Key: key, .model [2]
key .model index value .fitted .resid .innov
<chr> <chr> <mth> <dbl> <dbl> <dbl> <dbl>
1 fdeaths ets 1974 Jan 901 837. 64.0 0.0765
2 fdeaths ets 1974 Feb 689 877. -188. -0.214
3 fdeaths ets 1974 Mar 827 795. 31.7 0.0399
4 fdeaths ets 1974 Apr 677 624. 53.2 0.0852
5 fdeaths ets 1974 May 522 515. 7.38 0.0144
6 fdeaths ets 1974 Jun 406 453. -47.0 -0.104
7 fdeaths ets 1974 Jul 441 431. 9.60 0.0223
8 fdeaths ets 1974 Aug 393 388. 4.96 0.0128
9 fdeaths ets 1974 Sep 387 384. 2.57 0.00668
10 fdeaths ets 1974 Oct 582 480. 102. 0.212
# ... with 110 more rows
## forecast residuals
test <- lung_deaths %>%
dplyr::filter(index < yearmonth("1979 Jan")) %>%
model(
ets = ETS(value ~ error("M") + trend("A") + season("A"))) %>%
forecast(h = "1 year")
## defining newdata
Data <- lung_deaths %>%
dplyr::filter(index >= yearmonth("1979 Jan"))
augment(test, newdata = Data, type.predict='response')
# A tsibble: 120 x 7 [1M]
# Key: key, .model [2]
key .model index value .fitted .resid .innov
<chr> <chr> <mth> <dbl> <dbl> <dbl> <dbl>
1 fdeaths ets 1974 Jan 901 837. 64.0 0.0765
2 fdeaths ets 1974 Feb 689 877. -188. -0.214
3 fdeaths ets 1974 Mar 827 795. 31.7 0.0399
4 fdeaths ets 1974 Apr 677 624. 53.2 0.0852
5 fdeaths ets 1974 May 522 515. 7.38 0.0144
6 fdeaths ets 1974 Jun 406 453. -47.0 -0.104
7 fdeaths ets 1974 Jul 441 431. 9.60 0.0223
8 fdeaths ets 1974 Aug 393 388. 4.96 0.0128
9 fdeaths ets 1974 Sep 387 384. 2.57 0.00668
10 fdeaths ets 1974 Oct 582 480. 102. 0.212
# ... with 110 more rows
Any suggestions would be greatly appreciated.
I think you probably want forecast errors --- the difference between what is observed and what was predicted. See https://otexts.com/fpp3/accuracy.html for a discussion. To quote that chapter:
Note that forecast errors are different from residuals in two ways. First, residuals are calculated on the training set while forecast errors are calculated on the test set. Second, residuals are based on one-step forecasts while forecast errors can involve multi-step forecasts.
Here is some code to compute forecast errors on your example.
library(fable)
library(tsibble)
library(dplyr)
lung_deaths <- as_tsibble(cbind(mdeaths, fdeaths))
## forecasts
fcast <- lung_deaths %>%
dplyr::filter(index < yearmonth("1979 Jan")) %>%
model(
ets = ETS(value ~ error("M") + trend("A") + season("A"))
) %>%
forecast(h = "1 year")
## defining newdata
new_data <- lung_deaths %>%
dplyr::filter(index >= yearmonth("1979 Jan")) %>%
rename(actual = value)
# Compute forecast errors
fcast %>%
left_join(new_data) %>%
mutate(error = actual - .mean)
#> Joining, by = c("key", "index")
#> # A fable: 24 x 7 [1M]
#> # Key: key, .model [2]
#> key .model index value .mean actual error
#> <chr> <chr> <mth> <dist> <dbl> <dbl> <dbl>
#> 1 fdeaths ets 1979 Jan N(783, 8522) 783. 821 37.5
#> 2 fdeaths ets 1979 Feb N(823, 9412) 823. 785 -38.4
#> 3 fdeaths ets 1979 Mar N(742, 7639) 742. 727 -14.8
#> 4 fdeaths ets 1979 Apr N(570, 4516) 570. 612 41.7
#> 5 fdeaths ets 1979 May N(461, 2951) 461. 478 16.9
#> 6 fdeaths ets 1979 Jun N(400, 2216) 400. 429 29.5
#> 7 fdeaths ets 1979 Jul N(378, 1982) 378. 405 27.1
#> 8 fdeaths ets 1979 Aug N(335, 1553) 335. 379 44.5
#> 9 fdeaths ets 1979 Sep N(331, 1520) 331. 393 62.1
#> 10 fdeaths ets 1979 Oct N(427, 2527) 427. 411 -15.7
#> # … with 14 more rows
Created on 2020-11-03 by the reprex package (v0.3.0)
Related
I have a data source with daily sales per product.
I want to create a field that calculates the average daily sales for the 7 last days, for each product and day (e.g. on day 10 for product A, it will give me the average sales for product A on days 3 - 9; on Day 15 for product B, I'll see the average sales of B on days 8 - 14).
Is this possible?
Example data (I have the first 3 columns. need to generate the fourth)
Date Product Sales 7-Day Average
1/11 A 983 201
2/11 A 650 983
3/11 A 328 817
4/11 A 728 654
5/11 A 246 672
6/11 A 613 587
7/11 A 575 591
8/11 A 601 589
9/11 A 462 534
10/11 A 979 508
11/11 A 148 601
12/11 A 238 518
13/11 A 53 517
14/11 A 500 437
15/11 A 684 426
16/11 A 261 438
17/11 A 69 409
18/11 A 159 279
19/11 A 964 281
20/11 A 429 384
21/11 A 731 438
1/11 B 790 471
2/11 B 265 486
3/11 B 94 487
4/11 B 66 490
5/11 B 124 477
6/11 B 555 357
7/11 B 190 375
8/11 B 232 298
9/11 B 747 218
10/11 B 557 287
11/11 B 432 353
12/11 B 526 405
13/11 B 690 463
14/11 B 350 482
15/11 B 512 505
16/11 B 273 545
17/11 B 679 477
18/11 B 164 495
19/11 B 799 456
20/11 B 749 495
21/11 B 391 504
Haven't really tried anything. Couldn't figure out how to do get started with this)
This may not be the super perfect solution but it does give your expected result in a crude way.
Cross-join the same data source first as shown in the screenshot
Use the calculated field to get the last 7 day average
(CASE WHEN Date (Table 2) BETWEEN DATETIME_SUB(Date (Table 1), INTERVAL 7 DAY) AND DATETIME_SUB(Date (Table 1), INTERVAL 1 DAY) THEN Sales (Table 2) ELSE 0 END)/7
-
I wanted to know what is the best way to rename multiple columns after using nest() and map().
I have a sample code of what I have done at the moment to achieve this using the iris dataset.
iris_names <- colnames(iris[, 1:4])
iris_sqrt <- iris %>%
nest(-Species) %>%
mutate(square_root = map(data, sqrt)) %>%
unnest(square_root)
names(iris_sqrt)[3:ncol(iris_sqrt)] <- paste0(iris_names, ".sd")
Here, I make a vector of the column names I want to rename before creating iris_sqrt and then renaming using paste0. The drawback of this method is the column name has to appear in the same order as the iris_names vector to rename correctly.
Is there a tidyverse/dplyr way of doing this?
You can use mutate inside the map call. Use across on everything, apply sqrt, using the .names argument you can change the names of the new columns, and use .keep = "unused" in mutate to remove the columns that were used during the calculation:
iris %>%
nest(data = -Species) %>%
mutate(square_root = map(data, ~ .x %>%
mutate(across(everything(), sqrt,
.names = "{.col}.sd"),
.keep = "unused"))) %>%
unnest(square_root)
output
# A tibble: 150 × 6
Species data Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
<fct> <list> <dbl> <dbl> <dbl> <dbl>
1 setosa <tibble [50 × 4]> 2.26 1.87 1.18 0.447
2 setosa <tibble [50 × 4]> 2.21 1.73 1.18 0.447
3 setosa <tibble [50 × 4]> 2.17 1.79 1.14 0.447
4 setosa <tibble [50 × 4]> 2.14 1.76 1.22 0.447
5 setosa <tibble [50 × 4]> 2.24 1.90 1.18 0.447
6 setosa <tibble [50 × 4]> 2.32 1.97 1.30 0.632
7 setosa <tibble [50 × 4]> 2.14 1.84 1.18 0.548
8 setosa <tibble [50 × 4]> 2.24 1.84 1.22 0.447
9 setosa <tibble [50 × 4]> 2.10 1.70 1.18 0.447
10 setosa <tibble [50 × 4]> 2.21 1.76 1.22 0.316
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
I don't know how to use read.table command when the data I want to read has some columns with character strings.
I have a .dat file that contains 28 columns and 100 rows.
Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV DirV Temp SO2_MH NOx_MH CO_MH O3_MH PM10_MH PM2.5_MH Pred_SO2 Pred_NOx PredBin_SO2 PredBin_NOx CodM_SO2 CodM_NOx Mensaje_SO2 Mensaje_NOx
2018 5 15 16 38 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 15 16 39 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 16 11 29 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
2018 5 16 11 30 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
When I try to read the data it reads ok the first 26 columns, but the 27th and 28th ones are "No" and "hay", so I want to read the full sentence in the 27th column and do the same in the 28th one.
This is what I use
min <- read.table("min.dat",header=T, fill = TRUE)
But I suppose I have to use the quote parameter somehow...
(I use fill=TRUE because some of this character strings are blank).
You can do this usinf readr::read_fwf() if you can specify the start and end positions of each column:
library(readr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fname <- 'sample.txt'
write_file(
' Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV DirV Temp SO2_MH NOx_MH CO_MH O3_MH PM10_MH PM2.5_MH Pred_SO2 Pred_NOx PredBin_SO2 PredBin_NOx CodM_SO2 CodM_NOx Mensaje_SO2 Mensaje_NOx
2018 5 15 16 38 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 15 16 39 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 16 11 29 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
2018 5 16 11 30 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma ',
fname
)
hdr <- read_lines(fname,n_max = 1)
cnames <- hdr %>%
trimws()%>%
strsplit('\\s+')%>%
unlist()
m <- gregexpr('\\S(?=\\s|$)',hdr,perl = T) # Find end position of columns
epos <-unlist(m)
spos <- lag(epos+1,1,default = 1)
read_fwf(fname,fwf_positions(start = spos,end = epos,col_names = cnames),skip = 1)
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> Mensaje_SO2 = col_character(),
#> Mensaje_NOx = col_character()
#> )
#> See spec(...) for full column specifications.
#> # A tibble: 4 x 28
#> Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 5 15 16 38 -1 -1 -1 -1 -1 -1 -1
#> 2 2018 5 15 16 39 -1 -1 -1 -1 -1 -1 -1
#> 3 2018 5 16 11 29 4.15 7.51 0.33 77 13 5 1.13
#> 4 2018 5 16 11 30 4.15 7.51 0.33 77 13 5 1.13
#> # … with 16 more variables: DirV <dbl>, Temp <dbl>, SO2_MH <dbl>,
#> # NOx_MH <dbl>, CO_MH <dbl>, O3_MH <dbl>, PM10_MH <dbl>, PM2.5_MH <dbl>,
#> # Pred_SO2 <dbl>, Pred_NOx <dbl>, PredBin_SO2 <dbl>, PredBin_NOx <dbl>,
#> # CodM_SO2 <dbl>, CodM_NOx <dbl>, Mensaje_SO2 <chr>, Mensaje_NOx <chr>
Created on 2019-05-21 by the reprex package (v0.3.0)
I get 28 columns with the expected values
i'm writing an implementation of ls command but i found a problem with
the columns, i want to align them like the real ls -l
drwx------# 3 haxor123 candidate 102 Oct 3 14:43 Applications
drwxr-xr-x 21 haxor123 candidate 714 Nov 29 21:07 Desktop
drwxr-xr-x 4 haxor123 candidate 136 Nov 6 19:54 Documents
drwx------ 9 haxor123 candidate 306 Nov 28 22:28 Downloads
drwxr-xr-x# 396 haxor123 candidate 13464 Nov 29 19:52 Library
drwx------+ 3 haxor123 candidate 102 Aug 9 16:38 Movies
drwx------+ 4 haxor123 candidate 136 Oct 5 14:13 Music
drwxr-xr-x 3 haxor123 candidate 102 Oct 4 23:23 PicineRe
drwxr-xr-x 4 haxor123 candidate 136 Oct 4 23:52 PicineRee
drwxr-xr-x 3 haxor123 candidate 102 Oct 4 22:32 PicineReloaded
drwx------+ 4 haxor123 candidate 136 Nov 11 16:46 Pictures
drwxr-xr-x 6 haxor123 candidate 204 Nov 12 21:38 exam-basedir
lrwxr-xr-x 1 haxor123 candidate 34 Jul 16 10:12 goinfre ->
/Volumes/Storage/goinfre/haxor123/
drwxr-xr-x 4 haxor123 candidate 136 Oct 3 15:14 s
That's a part from ls -l function
temp = list;
ft_putstr("total ");
printblocks(list);
ft_putchar('\n');
while (temp != NULL)
{
lstat(temp->full_path, &fstat);
ft_permissions(temp, fstat);
ft_putstr(" ");
bytes1 = ft_itoa(fstat.st_nlink);
ft_putstr(bytes1);
ft_putstr(get_user(fstat));
bytes = ft_itoa(fstat.st_size);
len = ft_strlen(bytes);
ft_putstr(ft_strjoin(bytes, " "));
get_time(fstat, temp);
temp = temp->next;
if (temp != NULL)
ft_putchar('\n');
If you would like to read the source code for GNU commands directly, you can do so...and it may be a good learning experience:
Where can I find source code for Linux core commands?
In particular, here is ls.c:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c
The only way to know precisely "how ls does it" comes from that file. We are not psychic (at least I am not) so if you're going to be asking about any other programming method, it needs to be self-contained in your question what you're specifically trying to achieve and why you can't achieve it.
I have a couple problems that I'm interested in solving. I would like to sample and store the conc column in the array by a value e.g.:
newdata <- data[ which(data$conc > 8), ]
However, I would like to save the associated datetime stamp with it. Finally in another array, when the conc value exceeds 8.00 before falling below 8.00, I would like to store the duration of this episode. So for example, 21:30 would record as 15 minutes, and another time will be logged between 00:15 and 03:00 resulting in a stored value of 165 minutes.
datetime conc
20/08/2012 21:00 7.29
20/08/2012 21:15 7.35
20/08/2012 21:30 35.23
20/08/2012 21:45 7.44
20/08/2012 22:00 13.30
20/08/2012 22:15 7.60
20/08/2012 22:30 7.65
20/08/2012 22:45 7.70
20/08/2012 23:00 7.83
20/08/2012 23:15 8.07
20/08/2012 23:30 8.30
20/08/2012 23:45 22.44
21/08/2012 00:00 7.81
21/08/2012 00:15 10.67
21/08/2012 00:30 11.07
21/08/2012 00:45 8.29
21/08/2012 01:00 8.17
21/08/2012 01:15 8.29
21/08/2012 01:30 8.26
21/08/2012 01:45 8.93
21/08/2012 02:00 9.74
21/08/2012 02:15 9.69
21/08/2012 02:30 9.15
21/08/2012 02:45 9.52
21/08/2012 03:00 9.10
21/08/2012 03:15 7.10
Maybe one form would be to add two more columns to your data, one indicating that the conc is above 8 and another calculating the cumulative time before it returns below 8.
#generating data
data <- read.table(text="datetime conc
'20/08/2012 21:00' 7.29
'20/08/2012 21:15' 7.35
'20/08/2012 21:30' 35.23
'20/08/2012 21:45' 7.44
'20/08/2012 22:00' 13.30
'20/08/2012 22:15' 7.60
'20/08/2012 22:30' 7.65
'20/08/2012 22:45' 7.70
'20/08/2012 23:00' 7.83
'20/08/2012 23:15' 8.07
'20/08/2012 23:30' 8.30
'20/08/2012 23:45' 22.44
'21/08/2012 00:00' 7.81
'21/08/2012 00:15' 10.67
'21/08/2012 00:30' 11.07
'21/08/2012 00:45' 8.29
'21/08/2012 01:00' 8.17
'21/08/2012 01:15' 8.29
'21/08/2012 01:30' 8.26
'21/08/2012 01:45' 8.93
'21/08/2012 02:00' 9.74
'21/08/2012 02:15' 9.69
'21/08/2012 02:30' 9.15
'21/08/2012 02:45' 9.52
'21/08/2012 03:00' 9.10
'21/08/2012 03:15' 7.10", sep=" ", header=TRUE, stringsAsFactors=FALSE)
#converting to date
data$datetime<-as.POSIXct(data$datetime, format="%d/%m/%Y %H:%M")
#creating stamps
data$stamp <- NA
data$stamp[which(data$conc<8)] <- "less.than.8"
data$stamp[which(data$conc>8)] <- "greater.than.8"
#calculating cumulative durationg in the episodes of sequencies of conc>8
for (i in 1:nrow(data)){
if(data$stamp[i] =="less.than.8"){
data$cum.duration[i] <- 0}
if(data$stamp[i] =="greater.than.8"){
data$cum.duration[i] <- (data$datetime[i]-data$datetime[i-1])+data$cum.duration[i-1]}
}
This will result in the following table, then you can do whatever you want with it:
datetime conc stamp cum.duration
1 2012-08-20 21:00:00 7.29 less.than.8 0
2 2012-08-20 21:15:00 7.35 less.than.8 0
3 2012-08-20 21:30:00 35.23 greater.than.8 15
4 2012-08-20 21:45:00 7.44 less.than.8 0
5 2012-08-20 22:00:00 13.30 greater.than.8 15
6 2012-08-20 22:15:00 7.60 less.than.8 0
7 2012-08-20 22:30:00 7.65 less.than.8 0
8 2012-08-20 22:45:00 7.70 less.than.8 0
9 2012-08-20 23:00:00 7.83 less.than.8 0
10 2012-08-20 23:15:00 8.07 greater.than.8 15
11 2012-08-20 23:30:00 8.30 greater.than.8 30
12 2012-08-20 23:45:00 22.44 greater.than.8 45
13 2012-08-21 00:00:00 7.81 less.than.8 0
14 2012-08-21 00:15:00 10.67 greater.than.8 15
15 2012-08-21 00:30:00 11.07 greater.than.8 30
16 2012-08-21 00:45:00 8.29 greater.than.8 45
17 2012-08-21 01:00:00 8.17 greater.than.8 60
18 2012-08-21 01:15:00 8.29 greater.than.8 75
19 2012-08-21 01:30:00 8.26 greater.than.8 90
20 2012-08-21 01:45:00 8.93 greater.than.8 105
21 2012-08-21 02:00:00 9.74 greater.than.8 120
22 2012-08-21 02:15:00 9.69 greater.than.8 135
23 2012-08-21 02:30:00 9.15 greater.than.8 150
24 2012-08-21 02:45:00 9.52 greater.than.8 165
25 2012-08-21 03:00:00 9.10 greater.than.8 180
26 2012-08-21 03:15:00 7.10 less.than.8 0
To select only the end episodes, you can write:
lines <- which(data$conc>8)
lines <- lines[(lines[2:length(lines)] - lines[1:(length(lines)-1)])>1]
data[lines,]
Which will give you:
datetime conc stamp cum.duration
3 2012-08-20 21:30:00 35.23 greater.than.8 15
5 2012-08-20 22:00:00 13.30 greater.than.8 15
12 2012-08-20 23:45:00 22.44 greater.than.8 45
25 2012-08-21 03:00:00 9.10 greater.than.8 180