This is two questions in one; if I should be splitting them, please let me know.
I have a spreadsheet of HR data and I'm going to be cutting it into various cross sections. Each row currently represents an employee, the year of that particular report (so for example over a three year period, an employee would appear three times and a column includes which year that row's referring to) and a series of other characteristics. Furthermore, I've added a field which shows how many FTEs that employee represents for that period which represents that employees exposure to risk.
What I'm trying to do, for the sake of marrying it up with other data, is create an n dimensional array where each point represents the total exposure to risk that matches the dimensions. In the example I'm using, the dimensions are Year, Company [there are a couple], Age Band, Gender, Division, Tenure band.
To do so, among other code, I've written the following:
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
FactorDim <- lapply(length,mapply(unique,HR[FactorNames]))
Names <- lapply(HR[FactorNames], function(x)sort(unique(x)))
Index <- 1
for (Ten in 1:FactorDim[6]){
for (Job in 1:FactorDim[5]) {
for (Sex in 1:FactorDim[4]) {
for (Age in 1:FactorDim[3]) {
for (Co in 1:FactorDim[2]) {
for (Year in 1:FactorDim[1]) {
ExpList[Index] = sum(subset(HR,
HR$FY == Names[1,Year],
HR$`HR Business` == Names[2, Co],
HR$`Age Band` == Names[3, Age],
HR$Gender == Names[4, Sex],
HR$`Classification Level 1` == Names[5,Job],
HR$`Tenure Band` == Names[6,Ten],
select=Exposure),
na.rm=TRUE)
Index <- Index + 1
}
}
}
}
}
}
There are two main issues.
Names <- lapply(HR[FactorNames], function(x)sort(unique(x))) is incorrect as lapply(HR[FactorNames], function(x)sort(unique(x))) returns the unique values as a single combined element rather than as a vector. This means that the contents for my for loops throw the error Error in Names[1, Year] : incorrect number of dimensions.
There's no way that my concentric for loops are even close to being the optimal way to fill my array and I was wondering if anyone knew what was.
What would you recommend?
I made up some data
# make fake data
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
d <- as.data.frame(lapply(FactorNames,function(x){paste(x,sample(1:3,6,replace=T))}))
names(d) <- FactorNames
d$Name <- c('z','y','x','w','v','z')
d$Exposure <- randu[1:6,1]
From what I understand, your for loops intend to generate something like below in the d$sum_val column. A sum of all Exposure values for each combination of name and all factors.
# get sum
library(dplyr) # %>% pipe, group_by, and summarize
d %>%
group_by(Name, FY, `HR Business`, `Age Band`, Gender, `Classification Level 1`, `Tenure Band`) %>%
summarize(sum_val = sum(Exposure))
To make an n-dimensional array instead, look to acast with a formula like factor1 ~ factor2 ~ factor3 with ~ for each dim.
# lazy way to write out each of the factors
quoteFN <- lapply(c('Name',FactorNames),sprintf,fmt='`%s`')
concatFN <- paste(collapse=" ~ ", quoteFN )
# collapse into array
out <- reshape2::acast(d, as.formula(concatFN),value.var='Exposure',sum)
# what does it look like
dimnames(out)
dim(out)
Related
I'm trying to use MatchIt to create two sets of matched investment companies (treatment vs control).
I need to match the treatment companies to the control companies using only data from the 1-3 years proceeding the treatment.
For example if a company received treatment in 2009, then I would want to match it using data from 2009, 2008, 2007 (My after treatment effects dummy would hold a value from 2010 onwards in this case)
I am unsure how to add this selection into my matching code, which currently looks like this:
matchit(signatory ~ totalUSD + brownUSD + country + strategy, data = panel6, method = "full")
Should I consider using the 'after' treatments effects dummy in some way?
Any tips for how I add this in would be greatly appreciated!
There is no straightforward way to do this in MatchIt. You can set a caliper, which requires the control companies to be within a certain number of years from a treated company, but there isn't a way to require that control companies have a year strictly before the treated company. You can perform exact matching on year so that the treated and control companies have exactly the same year using the exact argument.
Another, slightly more involved way is to construct a distance matrix yourself and set to Inf any distances between units that are forbidden to match with each other. The first step would be estimating a propensity score, which you can do manually or using matchit(). Then you construct a distance matrix, and for each entry in the distance matrix, decide whether to set the distance to Inf. FInaly, you can supply the distance matrix to the distance argument of matchit(). Here's how you would do that:
#Estimate the propensity score
ps <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = NULL)$distance
#Create the distance matrix
dist <- optmatch::match_on(signatory ~ ps, data = panel6)
#Loop through the matrix and set set disallowed matches to Inf
t <- which(panel6$signatory == 1)
u <- which(panel6$signatory != 1)
for (i in seq_along(t)) {
for (j in seq_along(u)) {
if (panel6$year[u[j]] > panel6$year[t[i]] || panel6$year[u[j]] < panel6$year[t[i]] - 2)
dist[i,j] <- Inf
}
}
#Note: can be vectorized for speed but shouldn't take long regardless
#Supply the distance matrix to matchit() and match
m <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = "full", distance = dist)
That should work. You can verify by looking at individual groups of matched companies using match.data():
md <- match.data(m, data = panel6)
md <- md[with(md, order(subclass, signatory)),]
View(md) #assuming you're using RStudio
You should see that within subclasses, the control units are 0-2 years below the treated units.
I have measurements of maximum and minimum temperature and precipitation that are organized as arrays of size
(100x96x50769), where i and j are grid cells with coordinates associated and z means the number of measurements over time.
Conceptually, it looks like this:
I am using the climdex.pcic package to calculate indices of extreme weather events. Given a time series of maximum and minimum temperature and precipitation, the climdexInput.raw function will return a climdexIput object that can be used to determine several indices: number of frost days, number of summer days, consecutive dry days etc.
The call for the function is pretty simple:
ci <- climdexInput.raw(tmax=x, tmin=y, prec=z,
t, t, t, base.range=c(1961,1990))
where x is a vector of maximum temperatures, y is a vector of minimum temperatures, z is a vector of precipitation and t is a vector with dates under which x, y and z were measured.
What I would like to do is to extract the timeseries for each element of my array (i.e. each grid cell in the figure above) and use it to run the climdexInput.raw function.
Because of the large number of elements of real data, I want to run this task in parallel on my 4-core Linux server. However, I have no experience with parallelization in R.
Here's one example of my program (with intentionally reduced dimensions to make execution faster on your computer):
library(climdex.pcic)
# Create some dates
t <- seq(as.Date('2000-01-01'), as.Date('2010-12-31'), 'day')
# Parse the dates into PCICt
t <- as.PCICt(strftime(t), cal='gregorian')
# Create some dummy weather data, with dimensions `# of lat`, `# of lon` and `# of timesteps`
nc.min <- array(runif(10*9*4018, min=0, max=15), c(10, 9, 4018))
nc.max <- array(runif(10*9*4018, min=25, max=40), c(10, 9, 4018))
nc.prc <- array(runif(10*9*4018, min=0, max=25), c(10, 9, 4018))
# Create "ci" object
ci <- climdexInput.raw(tmax=nc.max[1,1,], tmin=nc.min[1,1,], prec=nc.prc[1,1,],
t, t, t, base.range=c(2000,2005))
# Once you have “ci”, you can compute any of the indices provided by the climdex.pcic package.
# The example below is for cumulative # of dry days per year:
cdd <- climdex.cdd(ci, spells.can.span.years = TRUE)
Now, please note that in the example above I used only the first element of my array ([1,1,]) as an example in the climdexInput.raw function.
How can do the same for all elements taking advantage of parallel processing, possibly by looping over the dimensions i and j of my array?
You can use foreach to do that:
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
res <- foreach(j = seq_len(ncol(nc.min))) %:%
foreach(i = seq_len(nrow(nc.min))) %dopar% {
ci <- climdex.pcic::climdexInput.raw(
tmax=nc.max[i,j,],
tmin=nc.min[i,j,],
prec=nc.prc[i,j,],
t, t, t,
base.range=c(2000,2005)
)
}
stopCluster(cl)
See my guide on parallelism using foreach: https://privefl.github.io/blog/a-guide-to-parallelism-in-r/.
Then, to compute an index, just use climdex.cdd(res[[1]][[1]], spells.can.span.years = TRUE) (j first, i second).
I have 4 dimensional array (ensemble of climate models with dimensions: lon, lat, time, models). I want to iterate over models (4th dim), and apply a function on each decade/10 year chunk (sequence over 3rd dim). The entire length of time series in my real data is 27 years, so I needed to introduce the "if" statement for the last decade which does not offer complete 10 years. However, when I try to manually verify the results - I am getting slightly higher numbers. This does not happen when I remove the if statement or use just perfect 20 years time series. It seems to me that the "if" statement somehow also changes the indices and counters for full decades. Can, please, someone shed the light what is happening? I hope that corrected code maybe useful for many people working on multidimensional arrays in general. I prepared the simplified code, which also generates the data and reproduce this behavior:
array1<- array(1:120, dim=c(3,3,17))
array2<- array(500:620, dim=c(3,3,17))
array3<- array(1000:1120, dim=c(3,3,17))
# create a empty array
nlon<-3
nlat<-3
nt<-17
last_decade_length <- 7
play_array<-matrix(0,nlon*nlat*nt*3)
dim(play_array)<-c(nlon,nlat,nt,3)
#Allocate DATA of array1 (model one)
play_array[,,,1]<-array1
#Allocate DATA of array2 (model two)
play_array[,,,2]<-array2
#Allocate DATA of array3 (model three)
play_array[,,,3]<-array3
# create an object to hold final result
mean_decade_object<-matrix(0,2*3)
dim(mean_decade_object)<-c(2,3)
### NESTED DOUBLE LOOP - "a" is index to itirate over decades; "b" index to itirate over arrays/models
a<-1
b<-1
for (array in 1:dim(play_array)[4]) {
print(paste("array",array))
for (decade in seq(1,dim(play_array)[3], 10)){
if(length(decade)==10){
mean_decade_object[a,b] <-mean(play_array[,,decade:(decade+9), array], na.rm = T)
print(mean(play_array[,,decade:(decade+9), array], na.rm = T))
print(paste("decade",decade))
a <-a+1
}
else{
mean_decade_object[a,b] <-mean(play_array[,,decade:(decade+last_decade_length-1), array], na.rm = T)
print(mean(play_array[,,decade:(decade+last_decade_length-1), array], na.rm = T))
print(paste("decade",decade))
a <-a+1
}
}
a<-1
b<-b+1
}
print(mean_decade_object)
##############
#VERIFICATION#
##############
sub1 <- play_array[,,,1]
mean(sub1[,,1:10], na.rm=T)
Why the result of verification (mean of 1st decade of 1st model) is not matching the top left cell in object resulting from the loop? Any ideas?
I tried to fix your existing code, to the best of my abilities, without changing too much.
The main issue in the code, was the fact that the condition in the if statement will never be met, you are looping over seq(1,dim(play_array)[3], 10), which is in this case 1, 11 meaning that length(decade) will always be 1, so the code is jumping to the else statement every time, and calculating the mean for seven years, a fact which can be checked if you changed th nuber of years in your verfication code to 7
sub1 <- play_array[,,,1]
mean(sub1[,,1:7], na.rm=T)
and here's the corrected code, though I believe a more efficient, or rather elegant method can be found to achieve the same goal.
mean_decade_object <- matrix(0, ncol = 3, nrow = 2)
for (array in 1:dim(play_array)[4]) {
for (decade in 1:floor(dim(play_array)[3]/10)){
decade_start <- decade*10 - 9
decade_end <- decade*10
mean_decade_object[decade, array] <- mean(play_array[,,decade_start:decade_end, array],
na.rm = T)
print(mean_decade_object[decade, array])
print(paste("decade",decade))
}
if (dim(play_array)[3] %% 10 != 0){
# dim(play_array)[3] %% 10 ----> are the remaining years
decade_start <- dim(play_array)[3] - (dim(play_array)[3] %% 10) + 1
decade_end <- dim(play_array)[3]
mean_decade_object[decade+1, array] <- mean(play_array[,,decade_start:decade_end, array],
na.rm = T)
}
}
I convert between data formats a lot. I'm sure this is quite common. In particular, I switch between arrays and lists. I'm trying to figure out if I'm doing it right, or if I'm missing any schemas that would greatly improve quality of life. Below I'll give some examples of how to achieve desired results in a couple situations.
Begin with the following array:
dat <- array(1:60, c(5,4,3))
Then, convert one or more of the dimensions of that array to a list. For clarification and current approaches, see the following:
1 dimension, array to list
# Convert 1st dim
dat_list1 <- unlist(apply(dat, 1, list),F,F) # this is what I usually do
# Convert 1st dim, (alternative approach)
library(plyr) # I don't use this approach often b/c I try to go base if I can
dat_list1a <- alply(dat, 1) # points for being concise!
# minus points to alply for being slow (in this case)
> microbenchmark(unlist(apply(dat, 1, list),F,F), alply(dat, 1))
Unit: microseconds
expr min lq mean median uq max neval
unlist(apply(dat, 1, list), F, F) 40.515 43.519 50.6531 50.4925 53.113 88.412 100
alply(dat, 1) 1479.418 1511.823 1684.5598 1595.4405 1842.693 2605.351 100
1 dimension, list to array
# Convert elements of list into new array dimension
# bonus points for converting to original array
dat_array1_0 <- simplify2array(dat_list1)
aperm.key1 <- sapply(dim(dat), function(x)which(dim(dat_array1_0)==x))
dat_array1 <- aperm(dat_array1_0,aperm.key1)
In general, these are the tasks I'm trying to accomplish, although sometimes it's in multiple dimensions or the lists are nested, or some such other complication. So I'm asking if anyone has a "better" (concise, efficient) way of doing either of these things, but bonus points if a suggested approach can handle other related scenarios too.
I imagine I am missing something quite simple here, or I am barking up the wrong tree completely, however I have been trying to sort this out over a number of days and my novice R skills haven't been able to crack it.
I am looking for a method to reference an array of values from within a R function. I am creating a simulated population, I have individuals age, sex and ethnicity and I want to simulate the presence of absence of diabetes. I have the prevalence of diabetes by age bracket, gender and ethnicity which I have made into a 2(gender)x11(age bracket)x6(ethnicity) array. What I want to do is the reference the correct cell within the array and used that with a runif called to run a bernoulli trial per individual.
The code below is the current version however I have tried a number of different methods with varying results:
function(AB,sex,eth){
AB<-AB
sex<- sex
eth<-as.numeric(eth)
#make matrix reference
#make 'european' equal to 'other'
eth <- ifelse(eth==7,6,eth)
#change male from a 0 coding to a 2 for array lookup
sex <- ifelse(sex==1,1,2)
#remove seven from AB due to diab data starting at 30-34 age bracket
agebracket <- AB-7
#random number drawn
diabbase <- runif(census$Total.Sex[AB],0,1)
#census$total.sex gives the total number in each age bracket
#array assignment
arrayvalue <- Darray[agebracket,sex,eth]
diab <- ifelse((diabbase >= (Darray[agebracket,sex,eth])) ,1,0)
return(diab)
}
if i call the function from the command line with "arrayvalue" returned rather than "diab" and individual values submitted rather than variables (ie diabtest <- diabgen(10,1,1) ) it returns the correct value from the array but if I submit the variables(ie diabtest <- diabgen(AB,sex,eth) it returns an empty array.
If I can give further info that might make what i am talking about clearer please let me know I would be more than happy to do so, it seems so easy but it is doing my head in. I am open to any suggestions on other/better ways of doing the same thing, any hints appreciated.
This maybe doesn't solve your problem (I'll update as needed), but it is a simple simulated dataframe for your conditions (2x11x6 factors)
brackets <- round(seq(15, 85, length.out = 12))
brlabels <- character()
for (i in 1:11) {
brlabels[i] <- paste(brackets[i], "to", brackets[i + 1], sep = " ")
}
AB <- cut(round(runif(100, 18, 80)), breaks = brackets, labels = brlabels)
sex <- factor(sample(c(1,2), 100, replace = TRUE), levels = c(1,2), labels = c("Male", "Female"))
eth <- factor(sample(c(1:6), 100, replace = TRUE), levels = c(1:6), labels = c("French", "German", "Swedish", "Polish", "Greek", "Italian"))
somerandombusiness <- rnorm(100, 50, 4)
sim.df <- data.frame(somerandombusiness)
sim.df$AB <- AB
sim.df$sex <- sex
sim.df$eth <- eth
It may be more cumbersome to select a specific intersection of the three at first, but most of the tools to deal with factor variables expect a dataframe.
Edit 1
You could do something like:
runif(1,0) >= (sim.df[which(sim.df$AB=="34 to 40"&sim.df$sex=="Male"&sim.df$eth=="German"), 1])
But I'm still not sure why you would want to. For one, with my method there is no way to be sure that all possible combinations are enumerated. You could up the sample size to a few thousand without much trouble but that would only make it really really likely that every combination existed. In this case I've chose one that does exist.
You could do this more easily w/ something like table(sim.df$eth, sim.df[, 1] > 60) which will give a cross-tab of all the somerandombusiness values > 60 and various ethnicities.