for a given data frame I would like to multiply values of an array to a column of the data frame. The data frame consists of rows, containing a name, a numerical value and two factor values:
name credit gender group
n1 10 m A
n2 20 f B
n3 30 m A
n4 40 m B
n5 50 f C
This data frame can be generated using the commands:
name <- c('n1','n2','n3','n4','n5')
credit <- c(10,20,30,40,50)
gender <- c('m','f','m','m','f')
group <- c('A','B','A','B','C')
DF <-data.frame(cbind(name,credit,gender,group))
# binds columns together and uses it as a data frame
Additionally we have a matrix derived from the data frame (in more complex cases this will be an array). This matrix contains the sum value of all contracts that fall into a particular category (characterized by m/f and A/B/C):
m f
A 40 NA
B 40 20
C NA 50
The goal is to multiply the values in DF$credit by using the corresponding value assigned to each category in the matrix, e.g. the value 10 of the first row in DF would be multiplied by 40 (the category defined by m and A).
The result would look like:
name credit gender group result
n1 10 m A 400
n2 20 f B 400
n3 30 m A 1200
n4 40 m B 1600
n5 50 f C 2500
If possible, I would like to perform this using the R base package but I am open for any helpful solutions that work nicely.
You can construct a set of indices into derived (being your derived matrix) by making an index matrix out of DF$group and DF$gender. The reason the as.character is there is because DF$group and DF$gender are factors, whereas I just want character indices.
>idx = matrix( c(as.character(DF$group),as.character(DF$gender)),ncol=2)
>idx
[,1] [,2]
[1,] "A" "m"
[2,] "B" "f"
[3,] "A" "m"
[4,] "B" "m"
[5,] "C" "f"
>DF$result = DF$credit * derived[idx]
Note with that last line, using the code you have above to generate DF, your numeric columns turn out as factors (ie DF$credit is a factor). In that case you need to do as.numeric(DF$credit)*derived[idx]. However, I imagine that in your actual data your data frame doesn't have DF$credit as a factor but instead as a numeric.
When you create the data.frame object, don't use cbind, it's not necessary and it forces the credit variable to become a factor.
Just use DF <- data.frame(name, credit, gender, group)
Then run a for loop that goes through each row in your data.frame object.
n <- length(DF$credit)
result <- rep(0, n)
for(i in 1:n) {
result[i] <- DF$credit[i] * sum(DF$credit[DF$gender==DF$gender[i] & DF$group==DF$group[i]])
}
Replace your data.frame object with this new one that includes your results.
DF <- data.frame(name, credit, gender, group, result)
I recommend the plyr package, but you can do this using the base by function:
> by(DF, DF['name'], function (row) row$credit * m[as.character(row$group), as.character(row$gender)])
name: n1
[1] 400
---------------------------------------------------------------------
name: n2
[1] 400
---------------------------------------------------------------------
name: n3
[1] 1200
---------------------------------------------------------------------
name: n4
[1] 1600
---------------------------------------------------------------------
name: n5
[1] 2500
plyr can give you the result as a data frame which is nice:
> ddply(DF, .(name), function (row) row$credit * m[as.character(row$group), as.character(row$gender)])
name V1
1 n1 400
2 n2 400
3 n3 1200
4 n4 1600
5 n5 2500
Related
Of the two matrices have one has i) the columns in different orders and ii) entire columns (every elements in the column) has the opposite different signs. An example would be
A = 1 2
3 4
b = 1.99 -1.02
3.99 -2.99
How can I re-order b such that it looks like:
b = 1.02 1.99
2.99 3.99
Is there away to do this quickly in R?
You could treat it as an optimization problem -- minimize the absolute difference between the two matrices by reordering the columns in one of the matrices.
Example data
A <- matrix(c(1, 2, 3, 4), nrow = 2)
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
b <- matrix(c(-2.99, 3.99, -1.02, 1.99), nrow = 2)
b
[,1] [,2]
[1,] -2.99 -1.02
[2,] 3.99 1.99
Optimization / search
# Data frame with a row for every possible column arrangement
ordering <- (expand.grid(rep(list(1:ncol(A)), ncol(A))))
ordering
Var1 Var2
1 1 1
2 2 1
3 1 2
4 2 2
# Create a function to compute the difference for a particular arrangement
loss <- function(i) {
ord <- unlist(ordering[i, ])
sum(abs(abs(A) - abs(b[, ord])))
}
# Find the best arrangement
result <- optimize(loss, 1:nrow(ordering))
result$minimum # row index from the data frame
[1] 2.145956
# Extract the row to get the actual solution
solution <- unname(unlist(ordering[result$minimum, ]))
solution
[1] 2 1
Verify
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
b[, solution]
[,1] [,2]
[1,] -1.02 -2.99
[2,] 1.99 3.99
Assuming that your matrices are as small as they are in your examples, you could change the order of the columns the following way:
Your example indicates you want to switch the first column and the second column. We can do that by reordering the column indexes like so:
b <- b[ , c(2, 1)]
c(2, 1) indicates that from now on, column 2 will be displayed as the first column and then column 1 will be displayed as the second column. We specify this in the column portion of the index operator and leave the row portion blank.
If we want to change the sign of an entire column, we can perform operations on specific columns like so:
b[ , 1] <- -1*b[ , 1]
This makes it so that every value in what is now the first column gets multiplied by -1.
If the matrix you're dealing with is much bigger, this is probably an impractical approach.
I am attempting to run a Mantel-Haenszel analysis in R to determine whether or not a comparison of proportions test is still significant when accounting for a 'diagnosis' ratio within groups. This test is available in the stats package.
library(stats)
mantelhaen.test(x)
Having done some reading, I've found that this test can perform an odds ratio test on a contingency table that is n x n x k, as opposed to simply n x n. However, I am having trouble arranging my data in the proper way, as I am fairly new to R. I have created some example data...
ex.label <- c("A","A","A","A","A","A","A","B","B","B")
ex.status <- c("+","+","-","+","-","-","-","+","+","-")
ex.diag <- c("X","X","Z","Y","Y","Y","X","Y","Z","Z")
ex.data <- data.frame(ex.label,ex.diag,ex.status)
Which looks like this...
ex.label ex.diag ex.status
1 A X +
2 A X +
3 A Z -
4 A Y +
5 A Y -
6 A Y -
7 A X -
8 B Y +
9 B Z +
10 B Z -
I was originally able to use a simple N-1 chi-square to run a comparison of proportions test of + to - for only the A and B, but now I want to be able to account for the ex.diag as well. I'll show a graph here for what I wanted to be looking at, which is basically to compare the significance of the ratio in each column. I was able to do this, but I now want to be able to account for ex.diag.
I tried to use the ftable() function to arrange my data in a way that would work.
ex.ftable <- ftable(ex.data)
Which looks like this...
ex.status - +
ex.label ex.diag
A X 1 2
Y 2 1
Z 1 0
B X 0 0
Y 0 1
Z 1 1
However, when I run mantelhaen.test(ex.ftable), I get the error 'x' must be a 3-dimensional array. How can I arrange my data in such a way that I can actually run this test?
In mantelhaen.test the last dimension of the 3-dimensional contingency table x needs to be the stratification variable (ex.diag). This matrix can be generated as follows:
ex.label <- c("A","A","A","A","A","A","A","B","B","B")
ex.status <- c("+","+","-","+","-","-","-","+","+","-")
ex.diag <- c("X","X","Z","Y","Y","Y","X","Y","Z","Z")
# Now ex.diag is in the first column
ex.data <- data.frame(ex.diag, ex.label, ex.status)
# The flat table
( ex.ftable <- ftable(ex.data) )
# ex.status - +
# ex.diag ex.label
# X A 1 2
# B 0 0
# Y A 2 1
# B 0 1
# Z A 1 0
# B 1 1
The 3D matrix can be generated using aperm.
# Trasform the ftable into a 2 x 2 x 3 array
# First dimension: ex.label
# Second dimension: ex.status
# Third dimension: ex.diag
( mtx3D <- aperm(array(t(as.matrix(ex.ftable)),c(2,2,3)),c(2,1,3)) )
# , , 1
#
# [,1] [,2]
# [1,] 1 2
# [2,] 0 0
#
# , , 2
#
# [,1] [,2]
# [1,] 2 1
# [2,] 0 1
#
# , , 3
#
# [,1] [,2]
# [1,] 1 0
# [2,] 1 1
Now the Cochran-Mantel-Haenszel chi-squared test can be performed.
# Cochran-Mantel-Haenszel chi-squared test of the null that
# two nominal variables are conditionally independent in each stratum
#
mantelhaen.test(mtx3D, exact=FALSE)
The results of the test is
Mantel-Haenszel chi-squared test with continuity correction
data: mtx3D
Mantel-Haenszel X-squared = 0.23529, df = 1, p-value = 0.6276
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
NaN NaN
sample estimates:
common odds ratio
Inf
Given the low number of cases, it is preferable to compute an exact conditional test (option exact=TRUE).
mantelhaen.test(mtx3D, exact=T)
# Exact conditional test of independence in 2 x 2 x k tables
#
# data: mtx3D
# S = 4, p-value = 0.5
# alternative hypothesis: true common odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.1340796 Inf
# sample estimates:
# common odds ratio
# Inf
I am new to R. I have an R array (or atleast I think) which gives the following output
> head(x)
[,1] [,2]
199 3.40 3.50
What is the 199 on the front mean? How do I extract the elements of this array?
As above, the 199 is a row name. You extract elements from a data frame (or vector) in R by using square brackets:
x[,1] # gives a column
x[1,] # give a row
x[1:2,] # gives several rows
You can also use column names like so:
x <- data.frame(col1 = c(1,2,3), col2 = c("A", "B", "C"))
x$col1 # 1 2 3
You'll figure out more as you start to play around in R and do some R tutorials.
A normal matrix would be 2-dimension matrix. But, I can initialise:
a<-array(0,dim=c(2,3,4,5))
Which is a 2*4*5*3 matrix, or array.
Command
apply(a,c(2,3),sum)
will give a 4*5 array, contain the sum over elements in the 1st and 4th dimension.
Why it that? As far as I know, in the apply function, 1 indicates rows, 2 indicates columns, but what does 3 mean here?
We need some abstraction here.
The easiest way to understand apply on an array is to try some examples. Here's some data modified from the last example object in the documentation:
> z <- array(1:24, dim = 2:4)
> dim(z)
[1] 2 3 4
> apply(z, 1, function(x) sum(x))
[1] 144 156
> apply(z, 2, function(x) sum(x))
[1] 84 100 116
> apply(z, 3, function(x) sum(x))
[1] 21 57 93 129
What's going on here? Well, we create a three-dimensional array z. If you use apply with MARGIN=1 you get row sums (two values because there are two rows), if you use MARGIN=2 you get column sums (three values because there are three columns), and if you use MARGIN=3 you get sums across the array's third dimension (four values because there are four levels to the third dimension of the array).
If you specify a vector for MARGIN, like c(2,3) you get the sum of the rows for each column and level of the third dimension. Note how in the above examples, the results from apply with MARGIN=1 are the row sums and with MARGIN=2 the column sums, respectively, of the matrix seen in the result below:
> apply(z, c(2,3), function(x) sum(x))
[,1] [,2] [,3] [,4]
[1,] 3 15 27 39
[2,] 7 19 31 43
[3,] 11 23 35 47
If you specify all of the dimensions as MARGIN=c(1,2,3) you simply get the original three-dimensional object:
> all.equal(z, apply(z, c(1,2,3), function(x) sum(x)))
[1] TRUE
Best way to learn here is just to start playing around with some real matrices. Your example data aren't helpful for looking at sums because all of the array entries are zero.
My array is
x <- array(1:24, dim=c(3,4,3))
My task 1 is to find the max value according to the first two dimensions
x.max <- apply(x,c(1,2), function(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)))
in case there is NA data
my task 2 is to find the max value position on the third dimension.
I tried
x.max.position = apply(x, c(1,2),which.max(x))
But this only give me the position on the fist two dimensions.
Can anyone help me?
It's not totally clear, but if you want to find the max for each matrix of the third dimension (is that even a technically right thing to say?), then you need to use apply across the third dimension. The argument margin under ?apply states that:
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
So for this example where you have a 3D array, 3 is the third dimension. So...
t( apply( x , 3 , function(x) which( x == max(x) , arr.ind = TRUE ) ) )
[,1] [,2]
[1,] 3 4
[2,] 3 4
[3,] 3 4
Which returns a matrix where each row contains the row and then column index of the max value of each 2D array/matrix of the third dimension.
If you want the max across all dimensions you can use which and the arr.ind argument like this:
which( x==max(x,na.rm=T) , arr.ind = T )
dim1 dim2 dim3
[1,] 3 4 2
Which tells us the max value is the third row, fourth column, second matrix.
EDIT
To find the position at dim 3 where where values on dim 1 and 2 are max try:
which.max( apply( x , 3 , max ) )
# [1] 2
Which tells us that at position 2 of the third dimension contains the maximal value.