conditional subsetting of two columns - r

conditional subsetting of two columns - r - arrays

I have a couple problems that I'm interested in solving. I would like to sample and store the conc column in the array by a value e.g.:
newdata <- data[ which(data$conc > 8), ]
However, I would like to save the associated datetime stamp with it. Finally in another array, when the conc value exceeds 8.00 before falling below 8.00, I would like to store the duration of this episode. So for example, 21:30 would record as 15 minutes, and another time will be logged between 00:15 and 03:00 resulting in a stored value of 165 minutes.
datetime conc
20/08/2012 21:00 7.29
20/08/2012 21:15 7.35
20/08/2012 21:30 35.23
20/08/2012 21:45 7.44
20/08/2012 22:00 13.30
20/08/2012 22:15 7.60
20/08/2012 22:30 7.65
20/08/2012 22:45 7.70
20/08/2012 23:00 7.83
20/08/2012 23:15 8.07
20/08/2012 23:30 8.30
20/08/2012 23:45 22.44
21/08/2012 00:00 7.81
21/08/2012 00:15 10.67
21/08/2012 00:30 11.07
21/08/2012 00:45 8.29
21/08/2012 01:00 8.17
21/08/2012 01:15 8.29
21/08/2012 01:30 8.26
21/08/2012 01:45 8.93
21/08/2012 02:00 9.74
21/08/2012 02:15 9.69
21/08/2012 02:30 9.15
21/08/2012 02:45 9.52
21/08/2012 03:00 9.10
21/08/2012 03:15 7.10

Maybe one form would be to add two more columns to your data, one indicating that the conc is above 8 and another calculating the cumulative time before it returns below 8.
#generating data
data <- read.table(text="datetime conc
'20/08/2012 21:00' 7.29
'20/08/2012 21:15' 7.35
'20/08/2012 21:30' 35.23
'20/08/2012 21:45' 7.44
'20/08/2012 22:00' 13.30
'20/08/2012 22:15' 7.60
'20/08/2012 22:30' 7.65
'20/08/2012 22:45' 7.70
'20/08/2012 23:00' 7.83
'20/08/2012 23:15' 8.07
'20/08/2012 23:30' 8.30
'20/08/2012 23:45' 22.44
'21/08/2012 00:00' 7.81
'21/08/2012 00:15' 10.67
'21/08/2012 00:30' 11.07
'21/08/2012 00:45' 8.29
'21/08/2012 01:00' 8.17
'21/08/2012 01:15' 8.29
'21/08/2012 01:30' 8.26
'21/08/2012 01:45' 8.93
'21/08/2012 02:00' 9.74
'21/08/2012 02:15' 9.69
'21/08/2012 02:30' 9.15
'21/08/2012 02:45' 9.52
'21/08/2012 03:00' 9.10
'21/08/2012 03:15' 7.10", sep=" ", header=TRUE, stringsAsFactors=FALSE)
#converting to date
data$datetime<-as.POSIXct(data$datetime, format="%d/%m/%Y %H:%M")
#creating stamps
data$stamp <- NA
data$stamp[which(data$conc<8)] <- "less.than.8"
data$stamp[which(data$conc>8)] <- "greater.than.8"
#calculating cumulative durationg in the episodes of sequencies of conc>8
for (i in 1:nrow(data)){
if(data$stamp[i] =="less.than.8"){
data$cum.duration[i] <- 0}
if(data$stamp[i] =="greater.than.8"){
data$cum.duration[i] <- (data$datetime[i]-data$datetime[i-1])+data$cum.duration[i-1]}
}
This will result in the following table, then you can do whatever you want with it:
datetime conc stamp cum.duration
1 2012-08-20 21:00:00 7.29 less.than.8 0
2 2012-08-20 21:15:00 7.35 less.than.8 0
3 2012-08-20 21:30:00 35.23 greater.than.8 15
4 2012-08-20 21:45:00 7.44 less.than.8 0
5 2012-08-20 22:00:00 13.30 greater.than.8 15
6 2012-08-20 22:15:00 7.60 less.than.8 0
7 2012-08-20 22:30:00 7.65 less.than.8 0
8 2012-08-20 22:45:00 7.70 less.than.8 0
9 2012-08-20 23:00:00 7.83 less.than.8 0
10 2012-08-20 23:15:00 8.07 greater.than.8 15
11 2012-08-20 23:30:00 8.30 greater.than.8 30
12 2012-08-20 23:45:00 22.44 greater.than.8 45
13 2012-08-21 00:00:00 7.81 less.than.8 0
14 2012-08-21 00:15:00 10.67 greater.than.8 15
15 2012-08-21 00:30:00 11.07 greater.than.8 30
16 2012-08-21 00:45:00 8.29 greater.than.8 45
17 2012-08-21 01:00:00 8.17 greater.than.8 60
18 2012-08-21 01:15:00 8.29 greater.than.8 75
19 2012-08-21 01:30:00 8.26 greater.than.8 90
20 2012-08-21 01:45:00 8.93 greater.than.8 105
21 2012-08-21 02:00:00 9.74 greater.than.8 120
22 2012-08-21 02:15:00 9.69 greater.than.8 135
23 2012-08-21 02:30:00 9.15 greater.than.8 150
24 2012-08-21 02:45:00 9.52 greater.than.8 165
25 2012-08-21 03:00:00 9.10 greater.than.8 180
26 2012-08-21 03:15:00 7.10 less.than.8 0
To select only the end episodes, you can write:
lines <- which(data$conc>8)
lines <- lines[(lines[2:length(lines)] - lines[1:(length(lines)-1)])>1]
data[lines,]
Which will give you:
datetime conc stamp cum.duration
3 2012-08-20 21:30:00 35.23 greater.than.8 15
5 2012-08-20 22:00:00 13.30 greater.than.8 15
12 2012-08-20 23:45:00 22.44 greater.than.8 45
25 2012-08-21 03:00:00 9.10 greater.than.8 180

Related

Google Data Studio: Compare daily sales to 7-day average

I have a data source with daily sales per product.
I want to create a field that calculates the average daily sales for the 7 last days, for each product and day (e.g. on day 10 for product A, it will give me the average sales for product A on days 3 - 9; on Day 15 for product B, I'll see the average sales of B on days 8 - 14).
Is this possible?
Example data (I have the first 3 columns. need to generate the fourth)
Date Product Sales 7-Day Average
1/11 A 983 201
2/11 A 650 983
3/11 A 328 817
4/11 A 728 654
5/11 A 246 672
6/11 A 613 587
7/11 A 575 591
8/11 A 601 589
9/11 A 462 534
10/11 A 979 508
11/11 A 148 601
12/11 A 238 518
13/11 A 53 517
14/11 A 500 437
15/11 A 684 426
16/11 A 261 438
17/11 A 69 409
18/11 A 159 279
19/11 A 964 281
20/11 A 429 384
21/11 A 731 438
1/11 B 790 471
2/11 B 265 486
3/11 B 94 487
4/11 B 66 490
5/11 B 124 477
6/11 B 555 357
7/11 B 190 375
8/11 B 232 298
9/11 B 747 218
10/11 B 557 287
11/11 B 432 353
12/11 B 526 405
13/11 B 690 463
14/11 B 350 482
15/11 B 512 505
16/11 B 273 545
17/11 B 679 477
18/11 B 164 495
19/11 B 799 456
20/11 B 749 495
21/11 B 391 504
Haven't really tried anything. Couldn't figure out how to do get started with this)

This may not be the super perfect solution but it does give your expected result in a crude way.
Cross-join the same data source first as shown in the screenshot
Use the calculated field to get the last 7 day average
(CASE WHEN Date (Table 2) BETWEEN DATETIME_SUB(Date (Table 1), INTERVAL 7 DAY) AND DATETIME_SUB(Date (Table 1), INTERVAL 1 DAY) THEN Sales (Table 2) ELSE 0 END)/7
-

How to read a character-string in a column of a data-set

I don't know how to use read.table command when the data I want to read has some columns with character strings.
I have a .dat file that contains 28 columns and 100 rows.
Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV DirV Temp SO2_MH NOx_MH CO_MH O3_MH PM10_MH PM2.5_MH Pred_SO2 Pred_NOx PredBin_SO2 PredBin_NOx CodM_SO2 CodM_NOx Mensaje_SO2 Mensaje_NOx
2018 5 15 16 38 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 15 16 39 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 16 11 29 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
2018 5 16 11 30 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
When I try to read the data it reads ok the first 26 columns, but the 27th and 28th ones are "No" and "hay", so I want to read the full sentence in the 27th column and do the same in the 28th one.
This is what I use
min <- read.table("min.dat",header=T, fill = TRUE)
But I suppose I have to use the quote parameter somehow...
(I use fill=TRUE because some of this character strings are blank).

You can do this usinf readr::read_fwf() if you can specify the start and end positions of each column:
library(readr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fname <- 'sample.txt'
write_file(
' Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV DirV Temp SO2_MH NOx_MH CO_MH O3_MH PM10_MH PM2.5_MH Pred_SO2 Pred_NOx PredBin_SO2 PredBin_NOx CodM_SO2 CodM_NOx Mensaje_SO2 Mensaje_NOx
2018 5 15 16 38 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 15 16 39 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 99.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0
2018 5 16 11 29 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma
2018 5 16 11 30 4.15 7.51 0.33 77.00 13.00 5.00 1.13 259.00 14.50 4.15 7.51 0.33 77.00 13.00 5.00 4.15 7.51 0.03 0.00 1 1 No hay alarma No hay alarma ',
fname
)
hdr <- read_lines(fname,n_max = 1)
cnames <- hdr %>%
trimws()%>%
strsplit('\\s+')%>%
unlist()
m <- gregexpr('\\S(?=\\s|$)',hdr,perl = T) # Find end position of columns
epos <-unlist(m)
spos <- lag(epos+1,1,default = 1)
read_fwf(fname,fwf_positions(start = spos,end = epos,col_names = cnames),skip = 1)
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> Mensaje_SO2 = col_character(),
#> Mensaje_NOx = col_character()
#> )
#> See spec(...) for full column specifications.
#> # A tibble: 4 x 28
#> Año Mes Día Hora Min SO2 NOx CO O3 PM10 PM2.5 VelV
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 5 15 16 38 -1 -1 -1 -1 -1 -1 -1
#> 2 2018 5 15 16 39 -1 -1 -1 -1 -1 -1 -1
#> 3 2018 5 16 11 29 4.15 7.51 0.33 77 13 5 1.13
#> 4 2018 5 16 11 30 4.15 7.51 0.33 77 13 5 1.13
#> # … with 16 more variables: DirV <dbl>, Temp <dbl>, SO2_MH <dbl>,
#> # NOx_MH <dbl>, CO_MH <dbl>, O3_MH <dbl>, PM10_MH <dbl>, PM2.5_MH <dbl>,
#> # Pred_SO2 <dbl>, Pred_NOx <dbl>, PredBin_SO2 <dbl>, PredBin_NOx <dbl>,
#> # CodM_SO2 <dbl>, CodM_NOx <dbl>, Mensaje_SO2 <chr>, Mensaje_NOx <chr>
Created on 2019-05-21 by the reprex package (v0.3.0)
I get 28 columns with the expected values

print each range in perl array

I have an array of ranges in Perl and need a way to loop through each range in the array, search a number and print the min..max indexes for each range. I am able to do this in bash shell scripting but having some trouble in Perl.
My code:
#!/usr/bin/perl
use List::Util qw(max min);
$search_num = 95;
#ranges = (73..80, 92..107, 941..1000, 3000..3170);
foreach $num (#ranges) {
$range_min = min(#ranges);
$range_max = max(#ranges);
if ($search_num == $n) {
print "$search was found in range $range_min..$range_max\n";
}
}
Desired output:
95 was found in range 92..107
The following works fine for indicating per hard coded range
but need a way to have a series of ranges in an array to loop, search and display where found. The following works:
#range = (92..107);
foreach $num (#range) {
$range_min = min(#range);
$range_max = max(#range);
if ($search_num == $num){
print "$search_num was found in range $range_min..$range_max\n";
}
}
Output:
95 was found in range 92..107
thanks for any advice.

#ranges=(73..80, 92..107, 941..1000, 3000..3170);
You seem to be under the impression that this will put separate range objects in #ranges. Instead, #range contains the following flat list:
$ perl -E '#ranges=(73..80, 92..107, 941..1000, 3000..3170); say "#ranges"'
73 74 75 76 77 78 79 80 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170
You can insert references to anonymous arrays in #ranges:
#ranges = ([73..80], [92..107], [941..1000], [3000..3170]);
However, since you already know the upper and lower limits of each range, why are you wasting memory?
#ranges=([73, 80], [92, 107], [941, 1000], [3000, 3170]);
Here is one way to implement that:
#!/usr/bin/env perl
use strict;
use warnings;
my #ranges=([73, 80], [92, 107], [941, 1000], [3000, 3170]);
my $search = 95;
my $found = search_in_ranges($search, \#ranges);
for my $r ( #$found ) {
printf "%d was found in [%d, %d]\n", $search, $r->[0], $r->[1];
}
sub search_in_ranges {
my ($n, $ranges) = #_;
return [ grep $n >= $_->[0] && $n <= $_->[1], #$ranges ];
}
See also perldoc perlreftut which is installed along with your Perl distribution.

Spotfire Line chart with min max bars

I am trying to make a chart that has a line graph showing the change in value in the count column for each month, and then two points showing the min and max value in that month. The table table is below.
Date Min Max Count
1/1/2015 0.28 6.02 13
2/1/2015 0.2 7.72 8
3/1/2015 1 1 1
4/1/2015 0.4 6.87 7
5/1/2015 0.36 3.05 8
6/1/2015 0.17 1.26 13
7/1/2015 0.31 1.59 15
8/1/2015 0.39 3.35 13
9/1/2015 0.22 0.86 10
10/1/2015 0.3 2.48 13
11/1/2015 0.16 0.82 9
12/1/2015 0.33 2.18 5
1/1/2016 0.23 1.16 14
2/1/2016 0.38 1.74 7
3/1/2016 0.1 8.87 9
4/1/2016 0.28 0.68 3
5/1/2016 0.13 3.23 11
6/1/2016 0.33 1 5
7/1/2016 0.28 1.26 4
8/1/2016 0.08 0.41 2
9/1/2016 0.43 0.61 2
10/1/2016 0.49 1.39 4
11/1/2016 0.89 0.89 1
I tried doing a scatter plot but when I try to Add a Line from Column value I get an error saying that the line cannot work on categorical data.
Any suggestions on how I can prepare this visualization?
Thanks!

I would do this in a combination chart.
Insert a combination chart (Line & Bar Graph)
On your X-Axis put your date as <BinByDateTime([Date],"Year.Month",1)>
On your Y-Axis put your aggregations: Sum([Count]), Max([Max]), Min([Min])
Right click > Properties > Series > set the Min and Max to Line Type
(Optional) Change the Y-Axis scale

Average of Counts

I Have a table called totals and the data looks like:
ACC_ID Data_ID Mon Weeks Total_AR_Count Total_FR_Count Total_OP_Count
23 9 01/2011 4 172 251 194
42 9 01/2011 4 2 16 28
75 9 01/2011 4 33 316 346
75 9 07/2011 5 1 12 20
42 9 09/2011 5 25 758 25
I want the output to be as Average of all the counts grouped by ACC_ID and Data_ID:
ACC_ID Data_ID Avg_AR_Count Avg_FR_Count Avg_OP_Count
23 9 172 251 194
42 9 13.5 387 26.5
75 9 17 164 183
How can do this?

Your description of what you want just about writes the SQL:
SELECT ACC_ID, Data ID, AVG(Total_AR_Count) AS Avg_AR_Count, AVG(Total_FR_Count) AS Avg_FR_Count...
FROM table
GROUP BY ACC_ID, Data_ID