MatchIt - how to make matching date specific? - matching

I'm trying to use MatchIt to create two sets of matched investment companies (treatment vs control).
I need to match the treatment companies to the control companies using only data from the 1-3 years proceeding the treatment.
For example if a company received treatment in 2009, then I would want to match it using data from 2009, 2008, 2007 (My after treatment effects dummy would hold a value from 2010 onwards in this case)
I am unsure how to add this selection into my matching code, which currently looks like this:
matchit(signatory ~ totalUSD + brownUSD + country + strategy, data = panel6, method = "full")
Should I consider using the 'after' treatments effects dummy in some way?
Any tips for how I add this in would be greatly appreciated!

There is no straightforward way to do this in MatchIt. You can set a caliper, which requires the control companies to be within a certain number of years from a treated company, but there isn't a way to require that control companies have a year strictly before the treated company. You can perform exact matching on year so that the treated and control companies have exactly the same year using the exact argument.
Another, slightly more involved way is to construct a distance matrix yourself and set to Inf any distances between units that are forbidden to match with each other. The first step would be estimating a propensity score, which you can do manually or using matchit(). Then you construct a distance matrix, and for each entry in the distance matrix, decide whether to set the distance to Inf. FInaly, you can supply the distance matrix to the distance argument of matchit(). Here's how you would do that:
#Estimate the propensity score
ps <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = NULL)$distance
#Create the distance matrix
dist <- optmatch::match_on(signatory ~ ps, data = panel6)
#Loop through the matrix and set set disallowed matches to Inf
t <- which(panel6$signatory == 1)
u <- which(panel6$signatory != 1)
for (i in seq_along(t)) {
for (j in seq_along(u)) {
if (panel6$year[u[j]] > panel6$year[t[i]] || panel6$year[u[j]] < panel6$year[t[i]] - 2)
dist[i,j] <- Inf
}
}
#Note: can be vectorized for speed but shouldn't take long regardless
#Supply the distance matrix to matchit() and match
m <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = "full", distance = dist)
That should work. You can verify by looking at individual groups of matched companies using match.data():
md <- match.data(m, data = panel6)
md <- md[with(md, order(subclass, signatory)),]
View(md) #assuming you're using RStudio
You should see that within subclasses, the control units are 0-2 years below the treated units.

Related

Creating a customisable n dimension array

This is two questions in one; if I should be splitting them, please let me know.
I have a spreadsheet of HR data and I'm going to be cutting it into various cross sections. Each row currently represents an employee, the year of that particular report (so for example over a three year period, an employee would appear three times and a column includes which year that row's referring to) and a series of other characteristics. Furthermore, I've added a field which shows how many FTEs that employee represents for that period which represents that employees exposure to risk.
What I'm trying to do, for the sake of marrying it up with other data, is create an n dimensional array where each point represents the total exposure to risk that matches the dimensions. In the example I'm using, the dimensions are Year, Company [there are a couple], Age Band, Gender, Division, Tenure band.
To do so, among other code, I've written the following:
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
FactorDim <- lapply(length,mapply(unique,HR[FactorNames]))
Names <- lapply(HR[FactorNames], function(x)sort(unique(x)))
Index <- 1
for (Ten in 1:FactorDim[6]){
for (Job in 1:FactorDim[5]) {
for (Sex in 1:FactorDim[4]) {
for (Age in 1:FactorDim[3]) {
for (Co in 1:FactorDim[2]) {
for (Year in 1:FactorDim[1]) {
ExpList[Index] = sum(subset(HR,
HR$FY == Names[1,Year],
HR$`HR Business` == Names[2, Co],
HR$`Age Band` == Names[3, Age],
HR$Gender == Names[4, Sex],
HR$`Classification Level 1` == Names[5,Job],
HR$`Tenure Band` == Names[6,Ten],
select=Exposure),
na.rm=TRUE)
Index <- Index + 1
}
}
}
}
}
}
There are two main issues.
Names <- lapply(HR[FactorNames], function(x)sort(unique(x))) is incorrect as lapply(HR[FactorNames], function(x)sort(unique(x))) returns the unique values as a single combined element rather than as a vector. This means that the contents for my for loops throw the error Error in Names[1, Year] : incorrect number of dimensions.
There's no way that my concentric for loops are even close to being the optimal way to fill my array and I was wondering if anyone knew what was.
What would you recommend?
I made up some data
# make fake data
FactorNames <- c("FY","HR Business", "Age Band", "Gender", "Classification Level 1", "Tenure Band")
d <- as.data.frame(lapply(FactorNames,function(x){paste(x,sample(1:3,6,replace=T))}))
names(d) <- FactorNames
d$Name <- c('z','y','x','w','v','z')
d$Exposure <- randu[1:6,1]
From what I understand, your for loops intend to generate something like below in the d$sum_val column. A sum of all Exposure values for each combination of name and all factors.
# get sum
library(dplyr) # %>% pipe, group_by, and summarize
d %>%
group_by(Name, FY, `HR Business`, `Age Band`, Gender, `Classification Level 1`, `Tenure Band`) %>%
summarize(sum_val = sum(Exposure))
To make an n-dimensional array instead, look to acast with a formula like factor1 ~ factor2 ~ factor3 with ~ for each dim.
# lazy way to write out each of the factors
quoteFN <- lapply(c('Name',FactorNames),sprintf,fmt='`%s`')
concatFN <- paste(collapse=" ~ ", quoteFN )
# collapse into array
out <- reshape2::acast(d, as.formula(concatFN),value.var='Exposure',sum)
# what does it look like
dimnames(out)
dim(out)

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)
I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))

A comparison between the two images by Alheistogram

My project is a query about the image and I am you first comparison between the two images by each image Histogram If alike given by presents to me that the picture is similar, but the problem whenever he tells me Enter the two are not alike
A=imread('C:\Users\saba\Desktop\images\q4.jpg');%reading images as array to variable 'a'&'b'
B = imread('C:\Users\saba\Desktop\images\q1.jpg');
j=rgb2gray(A);
i=rgb2gray(B);
subplot(2,2,1);imshow(A);
subplot(2,2,2);imshow(B);
subplot(2,2,3);imshow(j);
subplot(2,2,4);imshow(i);
if histeq(j)==histeq(i)
disp('The images are same')%output display
else
disp('the images are not same')
end
In order to compare directly with the == operator the images would have to be the same images. If you are wanting to do this, you could just check if i==j, provided they are the same size.
As far as I know, there is no builtin function or toolbox which checks whether or not two images are similar. One rough method you could use is seeing how different the sums of the pixel values for each of the rows and the columns are:
maxColumnDifference = max(abs(sum(j, 1) - sum(i, 1)));
maxRowDifference = max(abs(sum(j, 2) - sum(i, 2)));
You could then have some tolerance which the sums must be within, should be a function of the size of the image. To give a standardised answer (0-255) of how different the row or the column is, just divide each of the sums by the number of pixels.
maxColumnDifference = max(abs(sum(j, 1)/size(j,1) - sum(i, 1)/size(i,1)));
maxRowDifference = max(abs(sum(j, 2)/size(j,2) - sum(i, 2)/size(i,2)));
You could then determine if they are similar with something like:
tolerance = 50;
if (maxRowDifference < tolerance) && (maxColumnDifference < tolerance)
disp('Images are similarish');
else
disp('Images are not similar enough for this poor tool to recognise');
end
Note that this is all speculation, not tested at all, and there is probably a better way of doing it.

Uniformly sampling on hyperplanes

Given the vector size N, I want to generate a vector <s1,s2, ..., sn> that s1+s2+...+sn = S.
Known 0<S<1 and si < S. Also such vectors generated should be uniformly distributed.
Any code in C that helps explain would be great!
The code here seems to do the trick, though it's rather complex.
I would probably settle for a simpler rejection-based algorithm, namely: pick an orthonormal basis in n-dimensional space starting with the hyperplane's normal vector. Transform each of the points (S,0,0,0..0), (0,S,0,0..0) into that basis and store the minimum and maximum along each of the basis vectors. Sample uniformly each component in the new basis, except for the first one (the normal vector), which is always S, then transform back to the original space and check if the constraints are satisfied. If they are not, sample again.
P.S. I think this is more of a maths question, actually, could be a good idea to ask at http://maths.stackexchange.com or http://stats.stackexchange.com
[I'll skip "hyper-" prefix for simplicity]
One of possible ideas: generate many uniformly distributed points in some enclosing volume and project them on the target part of plane.
To get uniform distribution the volume must be shaped like the part of plane but with added margins along plane normal.
To uniformly generate points in such volumewe can enclose it in a cube and reject everything outside of the volume.
select margin, let's take margin=S for simplicity (once margin is positive it affects only performance)
generate a point in cube [-M,S+M]x[-M,S+M]x[-M,S+M]
if distance to the plane is more than M, reject the point and go to #2
project the point on the plane
check that projection falls into [0,S]x[0,S]x[0,S], if not - reject and go to #2
add this point to the resulting set and go to #2 is you need more points
The problem can be mapped to that of sampling on linear polytopes for which the common approaches are Monte Carlo methods, Random Walks, and hit-and-run methods (see https://www.jmlr.org/papers/volume19/18-158/18-158.pdf for examples a short comparison). It is related to linear programming, and can be extended to manifolds.
There is also the analysis of polytopes in compositional data analysis, e.g. https://link.springer.com/content/pdf/10.1023/A:1023818214614.pdf, which provide an invertible transformation between the plane and the polytope that can be used for sampling.
If you are working on low dimensions, you can use also rejection sampling. This means you first sample on the plane containing the polytope (defined by your inequalities). This later method is easy to implement (and wasteful, of course), the GNU Octave (I let the author of the question re-implement in C) code below is an example.
The first requirement is to get vector orthogonal to the hyperplane. For a sum of N variables this is n = (1,...,1). The second requirement is a point on the plane. For your example that could be p = (S,...,S)/N.
Now any point on the plane satisfies n^T * (x - p) = 0
we assume also that x_i >= 0
With these given you compute an orthonormal basis on the plane (the nullity of the vector n) and then create random combination on that bases. Finally you map back to the original space and apply your constraints on the generated samples.
# Example in 3D
dim = 3;
S = 1;
n = ones(dim, 1); # perpendicular vector
p = S * ones(dim, 1) / dim;
# null-space of the perpendicular vector (transposed, i.e. row vector)
# this generates a basis in the plane
V = null (n.');
# These steps are just to reduce the amount of samples that are rejected
# we build a tight bounding box
bb = S * eye(dim); # each column is a corner of the constrained region
# project on the null-space
w_bb = V \ (bb - repmat(p, 1, dim));
wmin = min (w_bb(:));
wmax = max (w_bb(:));
# random combinations and map back
nsamples = 1e3;
w = wmin + (wmax - wmin) * rand(dim - 1, nsamples);
x = V * w + p;
# mask the points inside the polytope
msk = true(1, nsamples);
for i = 1:dim
msk &= (x(i,:) >= 0);
endfor
x_in = x(:, msk); # inside the polytope (your samples)
x_out = x(:, !msk); # outside the polytope
# plot the results
scatter3 (x(1,:), x(2,:), x(3,:), 8, double(msk), 'filled');
hold on
plot3(bb(1,:), bb(2,:), bb(3,:), 'xr')
axis image

Referencing an array value from in a function in R

I imagine I am missing something quite simple here, or I am barking up the wrong tree completely, however I have been trying to sort this out over a number of days and my novice R skills haven't been able to crack it.
I am looking for a method to reference an array of values from within a R function. I am creating a simulated population, I have individuals age, sex and ethnicity and I want to simulate the presence of absence of diabetes. I have the prevalence of diabetes by age bracket, gender and ethnicity which I have made into a 2(gender)x11(age bracket)x6(ethnicity) array. What I want to do is the reference the correct cell within the array and used that with a runif called to run a bernoulli trial per individual.
The code below is the current version however I have tried a number of different methods with varying results:
function(AB,sex,eth){
AB<-AB
sex<- sex
eth<-as.numeric(eth)
#make matrix reference
#make 'european' equal to 'other'
eth <- ifelse(eth==7,6,eth)
#change male from a 0 coding to a 2 for array lookup
sex <- ifelse(sex==1,1,2)
#remove seven from AB due to diab data starting at 30-34 age bracket
agebracket <- AB-7
#random number drawn
diabbase <- runif(census$Total.Sex[AB],0,1)
#census$total.sex gives the total number in each age bracket
#array assignment
arrayvalue <- Darray[agebracket,sex,eth]
diab <- ifelse((diabbase >= (Darray[agebracket,sex,eth])) ,1,0)
return(diab)
}
if i call the function from the command line with "arrayvalue" returned rather than "diab" and individual values submitted rather than variables (ie diabtest <- diabgen(10,1,1) ) it returns the correct value from the array but if I submit the variables(ie diabtest <- diabgen(AB,sex,eth) it returns an empty array.
If I can give further info that might make what i am talking about clearer please let me know I would be more than happy to do so, it seems so easy but it is doing my head in. I am open to any suggestions on other/better ways of doing the same thing, any hints appreciated.
This maybe doesn't solve your problem (I'll update as needed), but it is a simple simulated dataframe for your conditions (2x11x6 factors)
brackets <- round(seq(15, 85, length.out = 12))
brlabels <- character()
for (i in 1:11) {
brlabels[i] <- paste(brackets[i], "to", brackets[i + 1], sep = " ")
}
AB <- cut(round(runif(100, 18, 80)), breaks = brackets, labels = brlabels)
sex <- factor(sample(c(1,2), 100, replace = TRUE), levels = c(1,2), labels = c("Male", "Female"))
eth <- factor(sample(c(1:6), 100, replace = TRUE), levels = c(1:6), labels = c("French", "German", "Swedish", "Polish", "Greek", "Italian"))
somerandombusiness <- rnorm(100, 50, 4)
sim.df <- data.frame(somerandombusiness)
sim.df$AB <- AB
sim.df$sex <- sex
sim.df$eth <- eth
It may be more cumbersome to select a specific intersection of the three at first, but most of the tools to deal with factor variables expect a dataframe.
Edit 1
You could do something like:
runif(1,0) >= (sim.df[which(sim.df$AB=="34 to 40"&sim.df$sex=="Male"&sim.df$eth=="German"), 1])
But I'm still not sure why you would want to. For one, with my method there is no way to be sure that all possible combinations are enumerated. You could up the sample size to a few thousand without much trouble but that would only make it really really likely that every combination existed. In this case I've chose one that does exist.
You could do this more easily w/ something like table(sim.df$eth, sim.df[, 1] > 60) which will give a cross-tab of all the somerandombusiness values > 60 and various ethnicities.

Resources