SSAS (Sexual Segregation and Aggregation Statistic) in R - calling C - c

I am running the following code, found in this appendix of a paper https://wiley.figshare.com/articles/Supplement_1_R_code_used_to_format_the_data_and_compute_the_SSAS_/3528698/1 to calculate the Sexual Segregation and Aggregation Statistic in R - but keep getting the following error - presumably there is an issue with calling a function from C, but I cannot resolve it.
# Main function, computes both the SSAS (Sexual Segregation and
# Aggregation Statistic) and the 95% limits of SSAS
# under the assumption of random association of both sexes
SSAS <- function(x, conf.int = 0.95, B = 10000)
{
x <- as.matrix(x)
nr <- nrow(x)
nc <- ncol(x)
sr <- rowSums(x)
sc <- colSums(x)
n <- sum(x)
E <- outer(sr, sc, "*")/n
dimnames(E) <- dimnames(x)
tmp <- .C("chisqsim", as.integer(nr), as.integer(nc),as.integer(sr),
as.integer(sc), as.integer(n), as.integer(B), as.double(E), integer(nr * nc),
double(n + 1), integer(nc), results = double(B), PACKAGE = "stats")
obs <- sum(sort((x - E)^2/E, decreasing = TRUE))/n
sim <- tmp$results/n
p0 <- (1 - conf.int)/2
return(c(obs, quantile(sim, p0), quantile(sim, 1 -p0)))
}
# This function formats data to be run with the SSAS function
splitmfd <- function(mfd) {
loca1 <- function(x) {
x <- t(x[, 1:2])
dimnames(x) <- list(c("mal", "fem"), as.character(1:ncol(x)))
x
}
l0 <- split(mfd, mfd$mon)
lapply(l0, loca1)
}
# Example 1: Isard
rup <- read.table("http://pbil.univ-lyon1.fr/R/donnees/mfdrupicapra.txt",
h = T)
# Load data from the web
plot1 <- function(w, titre = "") {
plot(1:12, w[, 1], ylim = range(w), ax = F, pch = 19,
type = "n", ylab = "IK", xlab = "")
title(main = titre)
box()
axis(1, 1:12, c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, pretty(range(w)), tck = 1)
polyx <- c(1:12, 12:1)
polyy <- c(w[, 3], rev(w[, 2]))
polygon(polyx, polyy, col = grey(0.9))
points(w[, 1], pch = 19, type = "b")
}
# Function to plot data and 95\ levels of significance
l1 <- splitmfd(rup)
# Format data to be used with SSAS function
w <- matrix(unlist(lapply(l1, SSAS)), ncol = 3, byrow = T)
# "w" is a matrix having 3 columns and 12 rows. In the first columns are
# the SSAS estimates for each month, and the lower and upper limits in columns 2
# and 3 respectively.
plot1(w, "Isard")
# Plot figure 3a
# Example 2: Red deer
cer <- read.table("http://pbil.univ-lyon1.fr/R/donnees/mfdcervus.txt",
h = T)
l1 <- splitmfd(cer)
w <- matrix(unlist(lapply(l1, SSAS)), ncol = 3, byrow = T)
plot1(w, "Red deer")
# Plot figure 3c
# Example 3: Roe deer
cap <- read.table("http://pbil.univ-lyon1.fr/R/donnees/mfdcapreolus.txt",
h = T)
l1 <- splitmfd(cap)
w <- matrix(unlist(lapply(l1, SSAS)), ncol = 3, byrow = T)
plot1(w, "Roe deer")
# Plot figure 3e
Here is my error:
w <- matrix(unlist(lapply(l1, SSAS)), ncol = 3, byrow = T)
Error in .C("chisqsim", as.integer(nr), as.integer(nc), as.integer(sr), :
"chisqsim" not available for .C() for package "stats"

This is a prime-example why you should not call internal C functions in R user code. R internals can (and do) change. Here, these changes are relevant: "more use of .Call" by Prof. Ripley.
Thus, you can change the function to this:
SSAS <- function(x, conf.int = 0.95, B = 10000)
{
x <- as.matrix(x)
nr <- nrow(x)
nc <- ncol(x)
sr <- rowSums(x)
sc <- colSums(x)
n <- sum(x)
E <- outer(sr, sc, "*")/n
dimnames(E) <- dimnames(x)
tmp <- .Call(stats:::C_chisq_sim, sr, sc, B, E)
obs <- sum(sort((x - E)^2/E, decreasing = TRUE))/n
sim <- tmp/n
p0 <- (1 - conf.int)/2
return(c(obs, quantile(sim, p0), quantile(sim, 1 -p0)))
}
Then, the code runs, but I have not checked for correctness. So, no warranties (as usual on Stack Overflow).

Related

equivalent of numpy.c_ in julia

Hi I am going through the book https://nnfs.io/ but using JuliaLang (it's a self-challenge to get to know the language better and use it more often.. rather than doing the same old same in Python..)
I have come across a part of the book in which they have custom wrote some function and I need to recreate it in JuliaLang...
source: https://cs231n.github.io/neural-networks-case-study/
python
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
# lets visualize the data:
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.show()
my julia version so far....
N = 100 # Number of points per class
D = 2 # Dimensionality
K = 3 # Number of classes
X = zeros((N*K, D))
y = zeros(UInt8, N*K)
# See https://docs.julialang.org/en/v1/base/math/#Base.range
for j in range(0,length=K)
ix = range(N*(j), length = N+1)
radius = LinRange(0.0, 1, N)
theta = LinRange(j*4, (j+1)*4, N) + randn(N)*0.2
X[ix] = ????????
end
notice the ??????? area because I am now trying to decipher if Julia has an equivalent for this numpy function
https://numpy.org/doc/stable/reference/generated/numpy.c_.html
Any help is appreciated.. or just tell me if I need to write something myself
This is a special object to provide nice syntax for column concatanation. In Julia this is just built into the language hence you can do:
julia> a=[1,2,3];
julia> b=[4,5,6];
julia> [a b]
3×2 Matrix{Int64}:
1 4
2 5
3 6
For your case the Julian equivalent of np.c_[r*np.sin(t), r*np.cos(t)] should be:
[r .* sin.(t) r .* cos.(t)]
To understand Python's motivation you can also have a look at :
numpy.r_ is not a function. What is it?
The equivalent of numpy.c_ would seem to be horizontal concatenation, which you can do with either the hcat function or with (e.g.) simply [a b]. Fixing a few other issues with the translation so far, we end up with
N = 100 # Number of points per class
D = 2 # Dimensionality
K = 3 # Number of classes
X = zeros(N*K, D)
y = zeros(UInt8, N*K)
for j in range(0,length=K)
ix = (N*j+1):(N*(j+1))
radius = LinRange(0.0, 1, N)
theta = LinRange(j*4, (j+1)*4, N) + randn(N)*0.2
X[ix,:] .= [radius.*sin.(theta) radius.*cos.(theta)]
y[ix] .= j
end
# visualize the data:
using Plots
scatter(X[:,1], X[:,2], zcolor=y, framestyle=:box)

R codes to extract ROC from Logistic regression model in 10 CV

I fitted a logistic regression model in 10-fold cv. I can use the pROC package to get the AUC but it seems the AUC is not for the 10-fold CV because the cvAUC library gave a different AUC. I suspect the AUC from pROC is for one fold. Please how can extract the joint AUC for the 10-fold using the pROC library?
data(iris)
data <- iris[which(iris$Species=="setosa" | iris$Species=="versicolor"),]
data$ID <- seq.int(nrow(data))
table(data$Species)
data$Species <-as.factor(data$Species)
confusion_matrices <- list()
accuracy <- c()
for (i in c(1:10)) {
set.seed(3456)
folds <- caret::createFolds(data$Species, k = 10)
test <- data[data$ID %in% folds[[i]], ]
train <- data[data$ID %in% unlist(folds[-i]), ]
model1 <- glm(as.factor(Species)~ ., family = binomial, data = train)
summary(model1)
pred <- predict(model1, newdata = test, type = "response")
predR <- as.factor( pred >= 0.5)
df <- data.frame(cbind(test$Species, predR))
df_list <- lapply(df, as.factor)
confusion_matrices[[i]] <- caret::confusionMatrix(df_list[[2]], df_list[[1]])
accuracy[[i]] <- confusion_matrices[[i]]$overall["Accuracy"]
}
library(pander)
library(dplyr)
names(accuracy) <- c("Fold 1",....,"Fold 10")
accuracy %>%
pander::pandoc.table()
mean(accuracy)

apply sum along subsets of array 3rd dimension

I have the following objects:
A: 1 array with x,y,z, dimensions -> containing a variable (Temperature)
B & C: 2 arrays with x,y dimensions -> containing the indices of vectors along A's z dimension
A <- array(rnorm(n = 12*4*3*5), dim = c(4,3,5))
dimnames(A) <- list("x" = c(1:4), "y" = c(1:3), "z" = c(1:5))
B <- matrix(rep(c(2:1), 6), nrow = 4)
dimnames(B) <- list("x" = c(1:4), "y" = c(1:3))
C <- matrix(rep(c(4:5), 6), nrow = 4)
dimnames(C) <- list("x" = c(1:4), "y" = c(1:3))
I'm looking for a way to apply sum of A across the z dimension only between the indices indicated by B and C.
If instead of the 3D-array I had a vector I would solve it like this:
> A <- round(c(rnorm(5)), 1)
> B <- 2 #index of first value to sum
> C <- 4 #index of last value to sum
> vindex <- seq(B,C,1)
> A
[1] 0.0 -0.9 -1.1 -1.7 -0.4
> vindex
[1] 2 3 4
> sum(A[vindex])
[1] -3.7
>
# or better with a function
> foo <- function(x, start_idx, end_idx) {
+ vidx <- seq(start_idx, end_idx, 1)
+ return(sum(x[vidx]))
+ }
>
> foo(A,B,C)
[1] -3.7
Unfortunately seq() does not accept vectors as arguments and therefore it's not straightforward to use the apply function. If again were A[x,y,z] and B and C[x,y]:
> apply(A,c(1,2),foo,B,C)
Error in seq.default(start_idx, end_idx, 1) : 'from' must be of length 1
Called from: seq.default(start_idx, end_idx, 1)
It would be great if anybody knew how to make this function workable with apply or with other clean solutions.
Thanks a lot!
This is not a very nice task for base R, and I would prefer to implement it in C++ in the absence of a package that already does so (?).
Logically speaking, a plain but vectorized solution to your problem could be structured as:
# initialize index array
D <- array(
1,
dim = c(4,3,5),
dimnames = list(x = letters[1:4], y = letters[1:3], z = letters[1:5])
)
# set indices out of bounds to zero
E <- rep(1:5, each = 4*3)
BB <- rep(B, times = 5)
D[E < BB] <- 0
CC <- rep(C, times = 5)
D[E > CC] <- 0
# multiply with index array and sum
apply(A * D, c(1,2), sum)

How to generate matrices directely into an array with a function?

I have a formula that creates matrices. Later with every single matrix of the set I have to do some time consuming stuff. So far, I'm bundling these matrices into a list with lapply(). Now, I assume operating with an array of matrices would be much faster. The thing is, I don't know how to let the matirices be generated into an array as with lapply().
I give you this example:
# matrix generating function
mxSim <- function(X, n) {
mx = matrix(NA, nrow = n, ncol = 3,
dimnames = list(NULL, c("d", "alpha", "beta")))
mx[,1] = rbinom(n, 1, .375)
mx[,2] = rnorm(n, 0, 2)
mx[,3] = .42 * rnorm(n, 0, 6)
return(mx)
}
# bundle matrices together
mx.lst <- lapply(1:1e1, mxSim, n = 1e4)
# some stuff to be done after, like e. g.:
lapply(mx.lst, function(m) lm(d ~ alpha + beta, as.data.frame(m)))
Could anybody give me some advise how to do this with an array?
I've been looking into this answer, but for it the matrices have to be already generated, and I only could help me by listing them before again.
Enough with the hooha. Lets time it.
library(microbenchmark)
# matrix generating function
mxSim <- function(X, n) {
mx = matrix(NA, nrow = n, ncol = 3,
dimnames = list(NULL, c("d", "alpha", "beta")))
mx[,1] = rbinom(n, 1, .375)
mx[,2] = rnorm(n, 0, 2)
mx[,3] = .42 * rnorm(n, 0, 6)
return(mx)
}
# bundle matrices together
mx.lst <- lapply(1:1e1, mxSim, n = 1e4)
mx.array <- array(mx.lst,dim=c(2,5))
# some stuff to be done after, like e. g.:
#Timing...
some.fnc<-function(m)lm(d ~ alpha + beta, as.data.frame(m))
list.test<-microbenchmark(lapply(mx.lst, some.fnc))
array.test<-microbenchmark(apply(mx.array, MARGIN=c(1,2), some.fnc))
expr min lq mean median uq max neval
lapply: 74.8953 101.9424 173.8733 146.7186 234.7577 397.2494 100
apply: 77.2362 101.0338 174.4178 137.153 264.6854 418.7297 100
Naively applying a function over a list as opposed to an array is almost identical in actual performance.
For the sake of completeness I just made some other benchmarks with n=1e3 as stated in the comment of #SeldomSeenSlim's answer. In addition I made it with a list of data.frames(), and this was a bit surprising.
Here is the function for data.frames, for matrix function see above.
dfSim <- function(X, n) {
d <- rbinom(n, 1, .375)
alpha <- rnorm(n, 0, 2)
beta <- .42 * rnorm(n, 0, 6)
data.frame(d, alpha, beta)
}
Bundeling
mx.lst <- lapply(1:1e3, mxSim, n = 1e4)
mx.array <- array(mx.lst, dim = c(2, 500))
df.lst <- lapply(1:1e3, dfSim, n = 1e4)
And the microbenchmarks:
some.fnc <- function(m) lm(d ~ alpha + beta, as.data.frame(m))
list.test <- microbenchmark(lapply(mx.lst, some.fnc))
array.test <- microbenchmark(apply(mx.array, MARGIN = c(1, 2), some.fnc))
df.list.test <- microbenchmark(lapply(df.lst, some.fnc))
Results
Unit: seconds
expr min lq mean median uq max neval
lapply 9.658568 9.742613 9.831577 9.784711 9.911466 10.30035 100
apply 9.727057 9.951213 9.994986 10.00614 10.06847 10.22178 100
lapply(df) 9.121293 9.229912 9.286592 9.277967 9.327829 10.12548 100
Now, what does us tell this?
But, okay, as a bold sidenote:
microbenchmark((lapply(1:1e3, mxSim, n = 1e4)), (lapply(1:1e3, dfSim, n = 1e4)))
expr min lq mean median uq max neval cld
(lapply(mxSim)) 2.533466 2.551199 2.563864 2.555421 2.559234 2.693383 100 a
(lapply(dfSim)) 2.676869 2.695826 2.718454 2.701161 2.706249 3.293431 100 b

Metropolis Hastings for linear regression model

I am trying to implement the Metropolis-Hastings algorithm for a simple linear regression in C (without use of other libraries (boost, Eigen etc.) and without two-dimensional arrays)*. For better testing of the code/evaluation of the trace plots, I have rewritten the code for R (see below) by keeping as much of the C-code as possible.
Unfortunately, the chains don't converge. I am wondering if
there is a mistake in the implementation itself?
"just" a bad choice of proposal distributions?
Assuming the latter, I am thinking about how to find good parameters of proposal distributions (currently I have picked arbitrary values) so that the algorithm works. Even with three parameters as in this case, it is quite hard to find suitable parameters. How does one normally handle this problem if say Gibbs sampling is not an alternative?
*I want to use this code for Cuda
#### posterior distribution
logPostDensity <- function(x, y, a, b, s2, N)
{
sumSqError = 0.0
for(i in 1:N)
{
sumSqError = sumSqError + (y[i] - (a + b*x[i]))^2
}
return(((-(N/2)+1) * log(s2)) + ((-0.5/s2) * sumSqError))
}
# x = x values
# y = actual datapoints
# N = sample size
# m = length of chain
# sigmaProp = uniform proposal for sigma squared
# paramAProp = uniform proposal for intercept
# paramBProp = uniform proposal for slope
mcmcSampling <- function(x,y,N,m,sigmaProp,paramAProp,paramBProp)
{
paramsA = vector("numeric",length=m) # intercept
paramsB = vector("numeric",length=m) # slope
s2 = vector("numeric",length=m) # sigma squared
paramsA[1] = 0
paramsB[1] = 0
s2[1] = 1
for(i in 2:m)
{
paramsA[i] = paramsA[i-1] + runif(1,-paramAProp,paramAProp)
if((logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N)
- logPostDensity(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N))
< log(runif(1)))
{
paramsA[i] = paramsA[i-1]
}
paramsB[i] = paramsB[i-1] + runif(1,-paramBProp,paramBProp)
if((logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N)
- logPostDensity(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N))
< log(runif(1)))
{
paramsB[i] = paramsB[i-1]
}
s2[i] = s2[i-1] + runif(1,-sigmaProp,sigmaProp)
if((s2[i] < 0) || (logPostDensity(x,y,paramsA[i],paramsB[i],s2[i],N)
- logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N))
< log(runif(1)))
{
s2[i] = s2[i-1]
}
}
res = data.frame(paramsA,paramsB,s2)
return(res)
}
#########################################
set.seed(321)
x <- runif(100)
y <- 2 + 5*x + rnorm(100)
summary(lm(y~x))
df <- mcmcSampling(x,y,10,5000,0.05,0.05,0.05)
par(mfrow=c(3,1))
plot(df$paramsA[3000:5000],type="l",main="intercept")
plot(df$paramsB[3000:5000],type="l",main="slope")
plot(df$s2[3000:5000],type="l",main="sigma")
There was one mistake in the intercept section (paramsA). Everything else was fine. I've implemented what Alexey suggested in his comments. Here's the solution:
pow <- function(x,y)
{
return(x^y)
}
#### posterior distribution
posteriorDistribution <- function(x, y, a, b,s2,N)
{
sumSqError <- 0.0
for(i in 1:N)
{
sumSqError <- sumSqError + pow(y[i] - (a + b*x[i]),2)
}
return((-((N/2)+1) * log(s2)) + ((-0.5/s2) * sumSqError))
}
# x <- x values
# y <- actual datapoints
# N <- sample size
# m <- length of chain
# sigmaProposalWidth <- width of uniform proposal dist for sigma squared
# paramAProposalWidth <- width of uniform proposal dist for intercept
# paramBProposalWidth <- width of uniform proposal dist for slope
mcmcSampling <- function(x,y,N,m,sigmaProposalWidth,paramAProposalWidth,paramBProposalWidth)
{
desiredAcc <- 0.44
paramsA <- vector("numeric",length=m) # intercept
paramsB <- vector("numeric",length=m) # slope
s2 <- vector("numeric",length=m) # sigma squared
paramsA[1] <- 0
paramsB[1] <- 0
s2[1] <- 1
accATot <- 0
accBTot <- 0
accS2Tot <- 0
for(i in 2:m)
{
paramsA[i] <- paramsA[i-1] + runif(1,-paramAProposalWidth,paramAProposalWidth)
accA <- 1
if((posteriorDistribution(x,y,paramsA[i],paramsB[i-1],s2[i-1],N) -
posteriorDistribution(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N)) < log(runif(1)))
{
paramsA[i] <- paramsA[i-1]
accA <- 0
}
accATot <- accATot + accA
paramsB[i] <- paramsB[i-1] + runif(1,-paramBProposalWidth,paramBProposalWidth)
accB <- 1
if((posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i-1],N) -
posteriorDistribution(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N)) < log(runif(1)))
{
paramsB[i] <- paramsB[i-1]
accB <- 0
}
accBTot <- accBTot + accB
s2[i] <- s2[i-1] + runif(1,-sigmaProposalWidth,sigmaProposalWidth)
accS2 <- 1
if((s2[i] < 0) || (posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i],N) -
posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i-1],N)) < log(runif(1)))
{
s2[i] <- s2[i-1]
accS2 <- 0
}
accS2Tot <- accS2Tot + accS2
if(i%%100==0)
{
paramAProposalWidth <- paramAProposalWidth * ((accATot/100)/desiredAcc)
paramBProposalWidth <- paramBProposalWidth * ((accBTot/100)/desiredAcc)
sigmaProposalWidth <- sigmaProposalWidth * ((accS2Tot/100)/desiredAcc)
accATot <- 0
accBTot <- 0
accS2Tot <- 0
}
}
res <- data.frame(paramsA,paramsB,s2)
return(res)
}

Resources