merging multiple data frames keeping the date sort - arrays

In R, I have 4 data frames with different dates and PL values:
head(Array1) gives:
Dates P&L
1 2014-10-01 900
2 2014-10-02 -3185
3 2014-10-03 3800
4 2014-10-07 -2300
5 2014-10-08 2100
6 2014-10-09 2400
head(array2) gives:
Dates P&L
1 2015-03-02 -6962.5
2 2015-03-03 -14237.5
3 2015-03-04 7862.5
4 2015-03-05 925.0
5 2015-03-09 -3725.0
6 2015-03-10 262.5
head(array3) gives:
Dates P&L
1 2014-10-08 7160
2 2014-10-09 7600
3 2014-10-10 2260
4 2014-10-13 4820
5 2014-10-15 -1500
6 2014-11-06 3030
head(array4) gives:
Dates P&L
1 2015-02-24 1245
2 2015-03-06 10650
3 2015-03-10 -200
4 2015-04-17 -9690
5 2015-05-15 -28740
6 2015-05-26 3970
I would like to aggregate all these arrays in just one array, keeping the date sort and summing when there are multiple values for one date. Can someone please help me? Joe

One option is to rbind all the data frames into a single data frame, then aggregate the values against Dates:
agg <- aggregate(`P&L` ~ Dates, rbind(array1, array2, array3, array4), FUN = sum)
agg[order(as.Date(agg$Dates)),]
# Dates P&L
#1 2014-10-01 900.0
#2 2014-10-02 -3185.0
#3 2014-10-03 3800.0
#4 2014-10-07 -2300.0
#5 2014-10-08 9260.0
#6 2014-10-09 10000.0
# ...
Or put the four arrays in a list, use do.call(rbind, ... to bind the data frames together:
lst <- list(array1, array2, array3, array4)
agg <- aggregate(`P&L` ~ Dates, do.call(rbind, lst), FUN = sum)
agg[order(as.Date(agg$Dates)),]

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

How to find the max,min value for specific season per year of a column in an array in R?

I have the following array which I called station:
A1 <- matrix(runif(120),24,5)
A1[1:12,1]<-2012
A1[13:24,1]<-2013
A1[1:12,2]<-(1:12)
A1[13:24,2]<-(1:12)
A1[1:12,3]<-seq(1,24,by=2)
A1[13:24,3]<-seq(1,24,by=2)
A2 <- matrix(runif(120),24,5)
A2[1:12,1]<-2012
A2[13:24,1]<-2013
A2[1:12,2]<-(1:12)
A2[13:24,2]<-(1:12)
A2[1:12,3]<-seq(1,24,by=2)
A2[13:24,3]<-seq(1,24,by=2)
station <- array(NA,c(24,5,2))
station[,,1] <- A1
station[,,2] <- A2
dimnames(station)[[2]]<-c('year','month','day','win_3','win_7')
dimnames(station)[[3]]<-c('station1','station2')
print(station)
I would like to extract max value of win_3 which I called Max_3Days through spring season (i.e, month 3,4 and 5) of each year for each station and specify the corresponding value of day and month (either 3,4 or 5).
The same thing for min value, I want to extract it from win_7 which I called Min_7Days through summer season (i.e, month 6,7 and 8) of each year for each station and specify the corresponding value of month (either 6,7 or 8) and day
I would like to keep the result in array format if it possible.
The result should be like this:
, , 1
Year Month day Max_3Days Year Month Day Min_7Days
[1,] 2012 3 15 2800 2012 6 1 400
[2,] 2013 4 2 2730 2013 6 4 100
, , 2
Year Month day Max_3Days Year Month Day Min_7Days
[1,] 2012 4 15 2800 2012 7 10 250
[2,] 2013 5 2 2750 2013 7 14 271
I did specify spring and summer season and find the max, min values when I was having only one station as a data frame format, I want to do this for about 70 station in a (matrix) of an array format, and I want to keep the result in an array:
In case of data frame (only one station):
Summer<-station[which(station$month>"5"&station$month<"9"),]
Minima<-ddply(Summer, ~ year, summarise, month=month[which.min(win_7)],day=day[which.min(win_7)], Min_7Days =min(win_7, na.rm = TRUE))
Spring<-station[which(station$month>"2"&station$month<"6"),]
Maxima<-ddply(Spring, ~ year, summarise, month=month[which.max(win_3)],day=day[which.max(win_3)], Max_3Days =max(win_3, na.rm = TRUE))
Any suggestion would be appreciated!!
currently i have made them into a list and went ahead.
l = vector('list', 2)
l[[1]] = data.frame(station[,,1])
l[[2]] = data.frame(station[,,2])
spring_end <- 5
spring_start <- 3
summer_end <- 8
summer_start <- 6
library(dplyr)
func <- function(df){
df %>% group_by(year) %>%
summarise( Max_3Days = max(win_3[between(month, spring_start, spring_end)]),
Month_spring = month[between(month, spring_start, spring_end)][which.max(win_3[between(month, spring_start, spring_end)])],
Min_7Days = min(win_7[between(month, summer_start, summer_end)]),
Month_summer = month[between(month, summer_start, summer_end)][which.min(win_7[between(month, summer_start, summer_end)])])
}
lapply(l, func)
#[[1]]
# year Max_3Days Month_spring Min_7Days Month_summer
#1 2012 0.6521762 5 0.3547476 6
#2 2013 0.9627131 3 0.1754293 6
#[[2]]
# year Max_3Days Month_spring Min_7Days Month_summer
#1 2012 0.6115331 5 0.08505264 6
#2 2013 0.6051239 3 0.10938192 8

Loop through two pandas dataframes

I have two dataframes df1 and df2 as shown below:
df1:
Month Count
6 314
6 418
6 123
7 432
df2:
Month ExpectedValue
6 324
7 512
8 333
I have to loop through df1 and df2. If df1['Month'] == 6, then I have to loop through df2 to get the expected value for month 6. Then, I will have the field in df1 as df1['ExpectedValue'].
Output like this below:
df1:
Month Count ExpectedValue
6 314 324
6 418 324
6 123 324
7 432 512
Is looping through 2 dataframes an efficient idea? Any help would be appreciated.
In general, you shouldn't loop over DataFrames unless it's absolutely necessary. You'll usually get better performance using a built-in Pandas function that's already been optimized, or by using a vectorized approach. This will usually result in cleaner code too.
In this case you can use DataFrame.merge:
df1 = df1.merge(df2, how='left', on='Month')
The resulting output:
Month Count ExpectedValue
0 6 314 324
1 6 418 324
2 6 123 324
3 7 432 512

Unpivoting Data in SSIS

I am attempting to normalize data using SSIS in the following format:
SerialNumber Date R01 R02 R03 R04
-------------------------------------------
1 9/25/2011 9 6 1 2
1 9/26/2011 4 1 3 5
2 9/25/2011 7 3 2 1
2 9/26/2011 2 4 10 6
Each "R" column represents a reading for an hour. R01 is 12:00 AM, R02 is 1:00 AM, R03 is 2:00 AM and R04 is 3:00 AM. I would like to transform the data and store it in another table in this format (line breaks for readability):
SerialNumber Date Reading
-----------------------------------------
1 9/25/2011 12:00 AM 9
1 9/25/2011 1:00 AM 6
1 9/25/2011 2:00 AM 1
1 9/25/2011 3:00 AM 2
1 9/26/2011 12:00 AM 4
1 9/26/2011 1:00 AM 1
1 9/26/2011 2:00 AM 3
1 9/26/2011 3:00 AM 5
2 9/25/2011 12:00 AM 7
2 9/25/2011 1:00 AM 3
2 9/25/2011 2:00 AM 2
2 9/25/2011 3:00 AM 1
2 9/26/2011 12:00 AM 2
2 9/26/2011 1:00 AM 4
2 9/26/2011 2:00 AM 10
2 9/26/2011 3:00 AM 6
I am using the unpivot transformation in an SSIS 2008 package to accomplish most of this but the issue I am having is adding the hour to the date based on the column of the value I am working with. Is there a way to accomplish this in SSIS? Keep in mind that this is a small subset of data of around 30 million records so performance is an issue.
Thanks for the help.
Create a SSIS package and add a new Data Flow Task and configure this DFT (Edit...)
Add a new data source
Add UNPIVOT component and configure it thus:
Add DATA CONVERSION component:
Temporary results:
Add DERIVED COLUMN component:
For NewData derived column you can use this expression: DATEADD("HOUR",(Type == "R01" ? 0 : (Type == "R02" ? 1 : (Type == "R03" ? 2 : 3))),Date). «boolean_expression» ? «when_true» : «when_false» operator is like IIF() function (from VBA/VB) and is used to calculate number of hours to add: for "R01" -> 0 hours, for "R02" -> 1 hour, for "R03" -> 2 hours or else 3 hours (for "R04").
Results:

Resources