Referencing an array value from in a function in R - arrays

I imagine I am missing something quite simple here, or I am barking up the wrong tree completely, however I have been trying to sort this out over a number of days and my novice R skills haven't been able to crack it.
I am looking for a method to reference an array of values from within a R function. I am creating a simulated population, I have individuals age, sex and ethnicity and I want to simulate the presence of absence of diabetes. I have the prevalence of diabetes by age bracket, gender and ethnicity which I have made into a 2(gender)x11(age bracket)x6(ethnicity) array. What I want to do is the reference the correct cell within the array and used that with a runif called to run a bernoulli trial per individual.
The code below is the current version however I have tried a number of different methods with varying results:
function(AB,sex,eth){
AB<-AB
sex<- sex
eth<-as.numeric(eth)
#make matrix reference
#make 'european' equal to 'other'
eth <- ifelse(eth==7,6,eth)
#change male from a 0 coding to a 2 for array lookup
sex <- ifelse(sex==1,1,2)
#remove seven from AB due to diab data starting at 30-34 age bracket
agebracket <- AB-7
#random number drawn
diabbase <- runif(census$Total.Sex[AB],0,1)
#census$total.sex gives the total number in each age bracket
#array assignment
arrayvalue <- Darray[agebracket,sex,eth]
diab <- ifelse((diabbase >= (Darray[agebracket,sex,eth])) ,1,0)
return(diab)
}
if i call the function from the command line with "arrayvalue" returned rather than "diab" and individual values submitted rather than variables (ie diabtest <- diabgen(10,1,1) ) it returns the correct value from the array but if I submit the variables(ie diabtest <- diabgen(AB,sex,eth) it returns an empty array.
If I can give further info that might make what i am talking about clearer please let me know I would be more than happy to do so, it seems so easy but it is doing my head in. I am open to any suggestions on other/better ways of doing the same thing, any hints appreciated.

This maybe doesn't solve your problem (I'll update as needed), but it is a simple simulated dataframe for your conditions (2x11x6 factors)
brackets <- round(seq(15, 85, length.out = 12))
brlabels <- character()
for (i in 1:11) {
brlabels[i] <- paste(brackets[i], "to", brackets[i + 1], sep = " ")
}
AB <- cut(round(runif(100, 18, 80)), breaks = brackets, labels = brlabels)
sex <- factor(sample(c(1,2), 100, replace = TRUE), levels = c(1,2), labels = c("Male", "Female"))
eth <- factor(sample(c(1:6), 100, replace = TRUE), levels = c(1:6), labels = c("French", "German", "Swedish", "Polish", "Greek", "Italian"))
somerandombusiness <- rnorm(100, 50, 4)
sim.df <- data.frame(somerandombusiness)
sim.df$AB <- AB
sim.df$sex <- sex
sim.df$eth <- eth
It may be more cumbersome to select a specific intersection of the three at first, but most of the tools to deal with factor variables expect a dataframe.
Edit 1
You could do something like:
runif(1,0) >= (sim.df[which(sim.df$AB=="34 to 40"&sim.df$sex=="Male"&sim.df$eth=="German"), 1])
But I'm still not sure why you would want to. For one, with my method there is no way to be sure that all possible combinations are enumerated. You could up the sample size to a few thousand without much trouble but that would only make it really really likely that every combination existed. In this case I've chose one that does exist.
You could do this more easily w/ something like table(sim.df$eth, sim.df[, 1] > 60) which will give a cross-tab of all the somerandombusiness values > 60 and various ethnicities.

Related

MatchIt - how to make matching date specific?

I'm trying to use MatchIt to create two sets of matched investment companies (treatment vs control).
I need to match the treatment companies to the control companies using only data from the 1-3 years proceeding the treatment.
For example if a company received treatment in 2009, then I would want to match it using data from 2009, 2008, 2007 (My after treatment effects dummy would hold a value from 2010 onwards in this case)
I am unsure how to add this selection into my matching code, which currently looks like this:
matchit(signatory ~ totalUSD + brownUSD + country + strategy, data = panel6, method = "full")
Should I consider using the 'after' treatments effects dummy in some way?
Any tips for how I add this in would be greatly appreciated!
There is no straightforward way to do this in MatchIt. You can set a caliper, which requires the control companies to be within a certain number of years from a treated company, but there isn't a way to require that control companies have a year strictly before the treated company. You can perform exact matching on year so that the treated and control companies have exactly the same year using the exact argument.
Another, slightly more involved way is to construct a distance matrix yourself and set to Inf any distances between units that are forbidden to match with each other. The first step would be estimating a propensity score, which you can do manually or using matchit(). Then you construct a distance matrix, and for each entry in the distance matrix, decide whether to set the distance to Inf. FInaly, you can supply the distance matrix to the distance argument of matchit(). Here's how you would do that:
#Estimate the propensity score
ps <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = NULL)$distance
#Create the distance matrix
dist <- optmatch::match_on(signatory ~ ps, data = panel6)
#Loop through the matrix and set set disallowed matches to Inf
t <- which(panel6$signatory == 1)
u <- which(panel6$signatory != 1)
for (i in seq_along(t)) {
for (j in seq_along(u)) {
if (panel6$year[u[j]] > panel6$year[t[i]] || panel6$year[u[j]] < panel6$year[t[i]] - 2)
dist[i,j] <- Inf
}
}
#Note: can be vectorized for speed but shouldn't take long regardless
#Supply the distance matrix to matchit() and match
m <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = "full", distance = dist)
That should work. You can verify by looking at individual groups of matched companies using match.data():
md <- match.data(m, data = panel6)
md <- md[with(md, order(subclass, signatory)),]
View(md) #assuming you're using RStudio
You should see that within subclasses, the control units are 0-2 years below the treated units.

Double nested loop with embedded "if/else" statement to iterate over multidimensional array in R

I have 4 dimensional array (ensemble of climate models with dimensions: lon, lat, time, models). I want to iterate over models (4th dim), and apply a function on each decade/10 year chunk (sequence over 3rd dim). The entire length of time series in my real data is 27 years, so I needed to introduce the "if" statement for the last decade which does not offer complete 10 years. However, when I try to manually verify the results - I am getting slightly higher numbers. This does not happen when I remove the if statement or use just perfect 20 years time series. It seems to me that the "if" statement somehow also changes the indices and counters for full decades. Can, please, someone shed the light what is happening? I hope that corrected code maybe useful for many people working on multidimensional arrays in general. I prepared the simplified code, which also generates the data and reproduce this behavior:
array1<- array(1:120, dim=c(3,3,17))
array2<- array(500:620, dim=c(3,3,17))
array3<- array(1000:1120, dim=c(3,3,17))
# create a empty array
nlon<-3
nlat<-3
nt<-17
last_decade_length <- 7
play_array<-matrix(0,nlon*nlat*nt*3)
dim(play_array)<-c(nlon,nlat,nt,3)
#Allocate DATA of array1 (model one)
play_array[,,,1]<-array1
#Allocate DATA of array2 (model two)
play_array[,,,2]<-array2
#Allocate DATA of array3 (model three)
play_array[,,,3]<-array3
# create an object to hold final result
mean_decade_object<-matrix(0,2*3)
dim(mean_decade_object)<-c(2,3)
### NESTED DOUBLE LOOP - "a" is index to itirate over decades; "b" index to itirate over arrays/models
a<-1
b<-1
for (array in 1:dim(play_array)[4]) {
print(paste("array",array))
for (decade in seq(1,dim(play_array)[3], 10)){
if(length(decade)==10){
mean_decade_object[a,b] <-mean(play_array[,,decade:(decade+9), array], na.rm = T)
print(mean(play_array[,,decade:(decade+9), array], na.rm = T))
print(paste("decade",decade))
a <-a+1
}
else{
mean_decade_object[a,b] <-mean(play_array[,,decade:(decade+last_decade_length-1), array], na.rm = T)
print(mean(play_array[,,decade:(decade+last_decade_length-1), array], na.rm = T))
print(paste("decade",decade))
a <-a+1
}
}
a<-1
b<-b+1
}
print(mean_decade_object)
##############
#VERIFICATION#
##############
sub1 <- play_array[,,,1]
mean(sub1[,,1:10], na.rm=T)
Why the result of verification (mean of 1st decade of 1st model) is not matching the top left cell in object resulting from the loop? Any ideas?
I tried to fix your existing code, to the best of my abilities, without changing too much.
The main issue in the code, was the fact that the condition in the if statement will never be met, you are looping over seq(1,dim(play_array)[3], 10), which is in this case 1, 11 meaning that length(decade) will always be 1, so the code is jumping to the else statement every time, and calculating the mean for seven years, a fact which can be checked if you changed th nuber of years in your verfication code to 7
sub1 <- play_array[,,,1]
mean(sub1[,,1:7], na.rm=T)
and here's the corrected code, though I believe a more efficient, or rather elegant method can be found to achieve the same goal.
mean_decade_object <- matrix(0, ncol = 3, nrow = 2)
for (array in 1:dim(play_array)[4]) {
for (decade in 1:floor(dim(play_array)[3]/10)){
decade_start <- decade*10 - 9
decade_end <- decade*10
mean_decade_object[decade, array] <- mean(play_array[,,decade_start:decade_end, array],
na.rm = T)
print(mean_decade_object[decade, array])
print(paste("decade",decade))
}
if (dim(play_array)[3] %% 10 != 0){
# dim(play_array)[3] %% 10 ----> are the remaining years
decade_start <- dim(play_array)[3] - (dim(play_array)[3] %% 10) + 1
decade_end <- dim(play_array)[3]
mean_decade_object[decade+1, array] <- mean(play_array[,,decade_start:decade_end, array],
na.rm = T)
}
}

Creating a customisable n dimension array

This is two questions in one; if I should be splitting them, please let me know.
I have a spreadsheet of HR data and I'm going to be cutting it into various cross sections. Each row currently represents an employee, the year of that particular report (so for example over a three year period, an employee would appear three times and a column includes which year that row's referring to) and a series of other characteristics. Furthermore, I've added a field which shows how many FTEs that employee represents for that period which represents that employees exposure to risk.
What I'm trying to do, for the sake of marrying it up with other data, is create an n dimensional array where each point represents the total exposure to risk that matches the dimensions. In the example I'm using, the dimensions are Year, Company [there are a couple], Age Band, Gender, Division, Tenure band.
To do so, among other code, I've written the following:
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
FactorDim <- lapply(length,mapply(unique,HR[FactorNames]))
Names <- lapply(HR[FactorNames], function(x)sort(unique(x)))
Index <- 1
for (Ten in 1:FactorDim[6]){
for (Job in 1:FactorDim[5]) {
for (Sex in 1:FactorDim[4]) {
for (Age in 1:FactorDim[3]) {
for (Co in 1:FactorDim[2]) {
for (Year in 1:FactorDim[1]) {
ExpList[Index] = sum(subset(HR,
HR$FY == Names[1,Year],
HR$`HR Business` == Names[2, Co],
HR$`Age Band` == Names[3, Age],
HR$Gender == Names[4, Sex],
HR$`Classification Level 1` == Names[5,Job],
HR$`Tenure Band` == Names[6,Ten],
select=Exposure),
na.rm=TRUE)
Index <- Index + 1
}
}
}
}
}
}
There are two main issues.
Names <- lapply(HR[FactorNames], function(x)sort(unique(x))) is incorrect as lapply(HR[FactorNames], function(x)sort(unique(x))) returns the unique values as a single combined element rather than as a vector. This means that the contents for my for loops throw the error Error in Names[1, Year] : incorrect number of dimensions.
There's no way that my concentric for loops are even close to being the optimal way to fill my array and I was wondering if anyone knew what was.
What would you recommend?
I made up some data
# make fake data
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
d <- as.data.frame(lapply(FactorNames,function(x){paste(x,sample(1:3,6,replace=T))}))
names(d) <- FactorNames
d$Name <- c('z','y','x','w','v','z')
d$Exposure <- randu[1:6,1]
From what I understand, your for loops intend to generate something like below in the d$sum_val column. A sum of all Exposure values for each combination of name and all factors.
# get sum
library(dplyr) # %>% pipe, group_by, and summarize
d %>%
group_by(Name, FY, `HR Business`, `Age Band`, Gender, `Classification Level 1`, `Tenure Band`) %>%
summarize(sum_val = sum(Exposure))
To make an n-dimensional array instead, look to acast with a formula like factor1 ~ factor2 ~ factor3 with ~ for each dim.
# lazy way to write out each of the factors
quoteFN <- lapply(c('Name',FactorNames),sprintf,fmt='`%s`')
concatFN <- paste(collapse=" ~ ", quoteFN )
# collapse into array
out <- reshape2::acast(d, as.formula(concatFN),value.var='Exposure',sum)
# what does it look like
dimnames(out)
dim(out)

looping cor.test on split data

My small challenge is in the code of a loop I am trying to make of a dataframe that is split to allow correlations for each group
an example of what I need to achieve for each spp
rbt<-subset(Trjan,Trjan$Spp=="Redbilled Teal")
cotest<-cor.test(rbt$year,rbt$abundance)
vals<-c(cotest$estimate,cotest$p.value)
vals# at the end of the day I need a dataframe with species, slope & p value e.g. "Redbilled Teal" "its slope" "p value"
But because I have many spp I cant do this for all of them.After following some examples I got this code but I am failing to put my variables well.
uniq <- unique(unlist(Trjan$Spp))
for (i in 1:length(uniq)){
data_1 <- subset(Trjan, Spp == uniq[i])
cor.test(year,abundance)
vals<-c(estimate,p.value)
}
# error "abundance not found
any help. I thought my small problem would not need a sample of data, if need arise I can edit.
I finally got help from a friend, I realised that I needed to create a new empty data frame to store all my cor.test results by species
final.tab<-data.frame(Species=character(),cor_est=numeric(),cor_pval=numeric(),stringsAsFactors = F)
uniq <- unique(unlist(Trjan$Spp))
for (i in 1:length(uniq)){
data_1 <- subset(Trjan, Spp == uniq[i])
#I had to create an object to store your cor.test results and add the object name (i.e. "data_1$" before your column name)
cor.test.temp<-cor.test(data_1$year,data_1$abundance)
vals<-c(as.character(uniq[i]),round(as.numeric(cor.test.temp$estimate),3),round(as.numeric(cor.test.temp$p.value),3))
#progressively filling in my data.frame with cor.test results
final.tab[i,]<-vals
}

MATLAB solve array

I've got multiple arrays that you can't quite fit a curve/equation to, but i do need to solve them for a lot of values. Simplified it looks like this when i plot it, but the real ones have a lot more points:
So say i would like to solve for y=22,how would i do that? As you can see there'd be three solutions to this, but i only need the most left one.
Linear is okay, but i'd rather us a non-linear method.
The only way i found is to fit an equation to a set of points and solve that equation, but an equation can't approximate the array accurately enough.
This implementation uses a first-order interpolation- if you're looking for higher accuracy and it feels appropriate, you can use a similar strategy for another order estimator.
Assuming data is the name of your array containing data with x values in the first column and y values in the second, that the columns are sorted by increasing or decreasing x values, and you wanted to find all data at the value y = 22;
searchPoint = 22; %search for all solutions where y = 22
matchPoints = []; %matrix containing all values of x
for ii = 1:length(data)-1
if (data(ii,2)>searchPoint)&&(data(ii+1,2)<searchPoint)
xMatch = data(ii,1)+(searchPoint-data(ii,2))*(data(ii+1,1)-data(ii,1))/(data(ii+1,2)-data(ii,2)); %Linear interpolation to solve for xMatch
matchPoints = [matchPoints xMatch];
elseif (data(ii,2)<searchPoint)&&(data(ii+1,2)>searchPoint)
xMatch = data(ii,1)+(searchPoint-data(ii,2))*(data(ii+1,1)-data(ii,1))/(data(ii+1,2)-data(ii,2)); %Linear interpolation to solve for xMatch
matchPoints = [matchPoints xMatch];
elseif (data(ii,2)==searchPoint) %check if data(ii,2) is equal
matchPoints = [matchPoints data(ii,1)];
end
end
if(data(end,2)==searchPoint) %Since ii only goes to the rest of the data
matchPoints = [matchPoints data(end,1)];
end
This was written sans-compiler, but the logic was tested in octave (in other words, sorry if there's a slight typo in variable names, but the math should be correct)

Resources