Is it possible to process a 4 dimensional netCDF file in R? - arrays

I am aware of the ncdf package of R.
I am trying to plot wind speed and wind direction(4D) of WRFout netCDF file.
Variable as a function of (i,j,k,l).
New variables need to be created as the calculation shown in the code.
Looping through 4 for loops is taking too long. I understand advanced looping techniques such as plyr or tapply or mapply exist. But examples for these are only for 2D array/matrix.
Please suggest optimization package for 4D array.
library(ncdf)
ncin <- open.ncdf("wrfout.nc")
imax <- 425 #No of grids in Longitude
jmax <- 200 #No of grids in Latitude
kmax <- 40 #Vertical layers
lmax <- 11 #Time
paiv <- atan(1.0)/45.0
#However, UUin and VVin are in staggerd grid from input file
#dim(UUin) is (426,200,40,11)
#dim(VVin) is (425,201,40,11)
UUin <- get.var.ncdf(ncin,"U")
VVin <- get.var.ncdf(ncin,"V")
#Initialize arrays in normal grid
UU <- array(0.0, c(imax,jmax,kmax,lmax))
VV <- array(0.0, c(imax,jmax,kmax,lmax))
Wspd <- array(0.0,c(imax,jmax,kmax,lmax))
Wdir <- array(0.0,c(imax,jmax,kmax,lmax))
for (l in 1:lmax) {
for (k in 1:kmax) {
for (j in 1:jmax) {
for (i in 1:imax) {
#Change U and V staggerd grid to regular grid
UU[i,j,k,l] <- 0.5*(UUin[i,j,k,l]+ UUin[i+1,j,k,l])
VV[i,j,k,l] <- 0.5*(VVin[i,j,k,l]+ VVin[i,j+1,k,l])
#Now calculate wind speed and direction in regular grid
Wspd[i,j,k,l] <- sqrt(UU[i,j,k,l]*UU[i,j,k,l] + VV[i,j,k,l]* VV[i,j,k,l])
if (UU[i,j,k,l] == 0.0)
{ # startif
if(VV[i,j,k,l] < 0.0)
Wdir[i,j,k,l] = 0.0
else
Wdir[i,j,k,l] = 180.0
}
else
Wdir[i,j,k,l] = atan2(VV[i,j,k,l], UU[i,j,k,l]) / paiv
if(UU[i,j,k,l] < 0.0) {
Wdir[i,j,k,l] = 90.0 - Wdir[i,j,k,l]
}
else
Wdir[i,j,k,l] = 270.0 - Wdir[i,j,k,l]
if (Wdir[i,j,k,l] < 0.0)
{
Wdir[i,j,k,l] = Wdir[i,j,k,l] + 360.0
}
if (Wdir[i,j,k,l] > 360.0)
{
Wdir[i,j,k,l] = Wdir[i,j,k,l] - 360.0
}
} #end of i
} #end of j
} #end of k
} #end of l

Related

Kolomogorov-Smirnov test: C to R translation issue

I am having difficulty translating an algorithm from C to R. It's about Kolmogorov Smirnov test, and more specifically the KS probability function
In 'Numerical Recipes in C', 'probks', it's coded as
#include <math.h>
#define EPS1 0.001
#define EPS2 1.0e-8
float probks(float alam)
/*Kolmogorov-Smirnov probability function.*/
{
int j;
float a2,fac=2.0,sum=0.0,term,termbf=0.0;
a2 = -2.0*alam*alam;
for (j=1;j<=100;j++) {
term=fac*exp(a2*j*j);
sum += term;
if (fabs(term) <= EPS1*termbf || fabs(term) <= EPS2*sum) return sum;
fac = -fac; /*Alternating signs in sum.*/
termbf=fabs(term);
}
return 1.0; /* Get here only by failing to converge. */
}
I don't know how to handle the translation in R of the few last lines, all I have nowe is
PROBKS <- function(lambda) {
EPS1 <- 0.001; EPS2 <- 1.0e-8;
sum <- 0.0; fac <- 2.0; termbf <- 0.0;
a2 <- -2*lambda*lambda
for (j in 1:100) {
term <- fac * exp(a2*j*j)
sum <- sum + term
if ( (abs(term) <= EPS1*termbf) || (abs(term) <= EPS2*sum) ) {
break
} else {
fac <- -fac
}
}
termbf <- abs(term)
return(sum)
}
but this produces a non-monotonic probability function
where it should be $Q_KS(0) = 1$ and $Q_KS(\infty) = 0$.
Obviously, it's about how to interpret/encode the last 'if' statement.
Any help would be very appreciated. M
EDIT 1:
Here my session info
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4.3 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.7
[5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
[9] ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] withr_2.1.2 rvest_0.3.2 tidyselect_0.2.5 lattice_0.20-35
[5] pkgconfig_2.0.2 xml2_1.2.0 compiler_3.4.4 readxl_1.1.0
[9] Rcpp_0.12.19 cli_1.0.1 plyr_1.8.4 cellranger_1.1.0
[13] httr_1.3.1 tools_3.4.4 nlme_3.1-131.1 broom_0.5.0
[17] R6_2.3.0 bindrcpp_0.2.2 bindr_0.1.1 scales_1.0.0
[21] assertthat_0.2.0 gtable_0.2.0 stringi_1.1.7 rstudioapi_0.8
[25] backports_1.1.2 hms_0.4.2 munsell_0.5.0 grid_3.4.4
[29] colorspace_1.3-2 glue_1.3.0 lubridate_1.7.4 rlang_0.3.0.1
[33] magrittr_1.5 lazyeval_0.2.1 yaml_2.2.0 crayon_1.3.4
[37] haven_1.1.2 modelr_0.1.2 pillar_1.3.0 jsonlite_1.5
EDIT 2
Using Konrad's function ks_cdf and
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))
still gives 0 at 0
EDIT 3
After upgrading to 3.6.1
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
...
I still get the same plot as above, i.e. ks_cdf(0)=0 while it should be ks_sdf(0)=1
The code can be translated into R almost literally — it’s not clear why you diverged from the C code without reason. Here’s a literal, slightly cleaned up translation:
ks_cdf = function (lambda) {
EPS1 = 0.001
EPS2 = 1.0e-8
sum = 0
fac = 2
termbf = 0
a2 = -2 * lambda ^ 2
for (j in 1 : 100) {
term = fac * exp(a2 * j ^ 2)
sum = sum + term
if ((abs(term) <= EPS1 * termbf) || (abs(term) <= EPS2 * sum)) {
return(sum)
} else {
fac = -fac
termbf = abs(term)
}
}
1 # Failed to converge.
}
This code works but isn’t vectorised, which is something I’d change for a real implementation (but, by doing so, we’d lose the early exit).
Here’s an idiomatic R implementation using vectorised arithmetic and matrix multiplication:
ks_cdf = function (λ) {
eps1 = 0.001
eps2 = 1E-8
range = seq(1, 100)
terms = (-1) ^ (range - 1) * exp(-2 * range ^ 2 %*% t(λ ^ 2))
sums = 2 * colSums(terms)
pterms = abs(terms)
prev_pterms = rbind(0, pterms[-nrow(pterms), , drop = FALSE])
converged = apply(pterms <= eps1 * prev_pterms | pterms <= eps2 * sums, 2L, any)
sums[! converged] = 1
sums
}
And to show how nicely it vectorises, and that this is in fact a big deal:
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))

Metropolis Hastings for linear regression model

I am trying to implement the Metropolis-Hastings algorithm for a simple linear regression in C (without use of other libraries (boost, Eigen etc.) and without two-dimensional arrays)*. For better testing of the code/evaluation of the trace plots, I have rewritten the code for R (see below) by keeping as much of the C-code as possible.
Unfortunately, the chains don't converge. I am wondering if
there is a mistake in the implementation itself?
"just" a bad choice of proposal distributions?
Assuming the latter, I am thinking about how to find good parameters of proposal distributions (currently I have picked arbitrary values) so that the algorithm works. Even with three parameters as in this case, it is quite hard to find suitable parameters. How does one normally handle this problem if say Gibbs sampling is not an alternative?
*I want to use this code for Cuda
#### posterior distribution
logPostDensity <- function(x, y, a, b, s2, N)
{
sumSqError = 0.0
for(i in 1:N)
{
sumSqError = sumSqError + (y[i] - (a + b*x[i]))^2
}
return(((-(N/2)+1) * log(s2)) + ((-0.5/s2) * sumSqError))
}
# x = x values
# y = actual datapoints
# N = sample size
# m = length of chain
# sigmaProp = uniform proposal for sigma squared
# paramAProp = uniform proposal for intercept
# paramBProp = uniform proposal for slope
mcmcSampling <- function(x,y,N,m,sigmaProp,paramAProp,paramBProp)
{
paramsA = vector("numeric",length=m) # intercept
paramsB = vector("numeric",length=m) # slope
s2 = vector("numeric",length=m) # sigma squared
paramsA[1] = 0
paramsB[1] = 0
s2[1] = 1
for(i in 2:m)
{
paramsA[i] = paramsA[i-1] + runif(1,-paramAProp,paramAProp)
if((logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N)
- logPostDensity(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N))
< log(runif(1)))
{
paramsA[i] = paramsA[i-1]
}
paramsB[i] = paramsB[i-1] + runif(1,-paramBProp,paramBProp)
if((logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N)
- logPostDensity(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N))
< log(runif(1)))
{
paramsB[i] = paramsB[i-1]
}
s2[i] = s2[i-1] + runif(1,-sigmaProp,sigmaProp)
if((s2[i] < 0) || (logPostDensity(x,y,paramsA[i],paramsB[i],s2[i],N)
- logPostDensity(x,y,paramsA[i],paramsB[i],s2[i-1],N))
< log(runif(1)))
{
s2[i] = s2[i-1]
}
}
res = data.frame(paramsA,paramsB,s2)
return(res)
}
#########################################
set.seed(321)
x <- runif(100)
y <- 2 + 5*x + rnorm(100)
summary(lm(y~x))
df <- mcmcSampling(x,y,10,5000,0.05,0.05,0.05)
par(mfrow=c(3,1))
plot(df$paramsA[3000:5000],type="l",main="intercept")
plot(df$paramsB[3000:5000],type="l",main="slope")
plot(df$s2[3000:5000],type="l",main="sigma")
There was one mistake in the intercept section (paramsA). Everything else was fine. I've implemented what Alexey suggested in his comments. Here's the solution:
pow <- function(x,y)
{
return(x^y)
}
#### posterior distribution
posteriorDistribution <- function(x, y, a, b,s2,N)
{
sumSqError <- 0.0
for(i in 1:N)
{
sumSqError <- sumSqError + pow(y[i] - (a + b*x[i]),2)
}
return((-((N/2)+1) * log(s2)) + ((-0.5/s2) * sumSqError))
}
# x <- x values
# y <- actual datapoints
# N <- sample size
# m <- length of chain
# sigmaProposalWidth <- width of uniform proposal dist for sigma squared
# paramAProposalWidth <- width of uniform proposal dist for intercept
# paramBProposalWidth <- width of uniform proposal dist for slope
mcmcSampling <- function(x,y,N,m,sigmaProposalWidth,paramAProposalWidth,paramBProposalWidth)
{
desiredAcc <- 0.44
paramsA <- vector("numeric",length=m) # intercept
paramsB <- vector("numeric",length=m) # slope
s2 <- vector("numeric",length=m) # sigma squared
paramsA[1] <- 0
paramsB[1] <- 0
s2[1] <- 1
accATot <- 0
accBTot <- 0
accS2Tot <- 0
for(i in 2:m)
{
paramsA[i] <- paramsA[i-1] + runif(1,-paramAProposalWidth,paramAProposalWidth)
accA <- 1
if((posteriorDistribution(x,y,paramsA[i],paramsB[i-1],s2[i-1],N) -
posteriorDistribution(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N)) < log(runif(1)))
{
paramsA[i] <- paramsA[i-1]
accA <- 0
}
accATot <- accATot + accA
paramsB[i] <- paramsB[i-1] + runif(1,-paramBProposalWidth,paramBProposalWidth)
accB <- 1
if((posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i-1],N) -
posteriorDistribution(x,y,paramsA[i-1],paramsB[i-1],s2[i-1],N)) < log(runif(1)))
{
paramsB[i] <- paramsB[i-1]
accB <- 0
}
accBTot <- accBTot + accB
s2[i] <- s2[i-1] + runif(1,-sigmaProposalWidth,sigmaProposalWidth)
accS2 <- 1
if((s2[i] < 0) || (posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i],N) -
posteriorDistribution(x,y,paramsA[i],paramsB[i],s2[i-1],N)) < log(runif(1)))
{
s2[i] <- s2[i-1]
accS2 <- 0
}
accS2Tot <- accS2Tot + accS2
if(i%%100==0)
{
paramAProposalWidth <- paramAProposalWidth * ((accATot/100)/desiredAcc)
paramBProposalWidth <- paramBProposalWidth * ((accBTot/100)/desiredAcc)
sigmaProposalWidth <- sigmaProposalWidth * ((accS2Tot/100)/desiredAcc)
accATot <- 0
accBTot <- 0
accS2Tot <- 0
}
}
res <- data.frame(paramsA,paramsB,s2)
return(res)
}

R - avoid nested for loops

I have the following function which takes 4 vectors. The T vector has a given length and all 3 other vectors (pga, Sa5Hz and Sa1Hz) have a given (identical but not necessarily equal to T) lenght.
The output is a matrix with length(T) rows and length(pga) columns.
My code below seems like the perfect example of what NOT to do, however, I could not figure out a way to optimize it using an apply function. Can anyone help?
designSpectrum <- function (T, pga, Sa5Hz, Sa1Hz){
Ts <- Sa1Hz / Sa5Hz
#By convention, if Sa5Hz is null, set Ts as 0.
Ts[is.nan(Ts)] <- 0
res <- matrix(NA, nrow = length(T), ncol = length(pga))
for (i in 1:nrow(res))
{
for (j in 1:ncol(res))
{
res[i,j] <- if(T[i] <= 0) {pga[j]}
else if (T[i] <= 0.2 * Ts[j]) {pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j])}
else if (T[i] <= Ts[j]) {Sa5Hz[j]}
else Sa1Hz[j] / T[i]
}
}
return(res)
}
Instead of doing a double for loop and processing each i and j value separately, you could use the outer function to process all of them in one shot. Since you're now processing multiple i and j values simultaneously, you could switch to the vectorized ifelse statement instead of the non-vectorized if and else statements:
designSpectrum2 <- function (T, pga, Sa5Hz, Sa1Hz) {
Ts <- Sa1Hz / Sa5Hz
Ts[is.nan(Ts)] <- 0
outer(1:length(T), 1:length(pga), function(i, j) {
ifelse(T[i] <= 0, pga[j],
ifelse(T[i] <= 0.2 * Ts[j], pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j]),
ifelse(T[i] <= Ts[j], Sa5Hz[j], Sa1Hz[j] / T[i])))
})
}
identical(designSpectrum(T, pga, Sa5Hz, Sa1Hz), designSpectrum2(T, pga, Sa5Hz, Sa1Hz))
# [1] TRUE
Data:
T <- -1:3
pga <- 1:3
Sa5Hz <- 2:4
Sa1Hz <- 3:5
You can see the efficiency gains by testing on rather large vectors (here I'll use an output matrix with 1 million entries):
# Larger vectors
set.seed(144)
T2 <- runif(1000, -1, 3)
pga2 <- runif(1000, -1, 3)
Sa5Hz2 <- runif(1000, -1, 3)
Sa1Hz2 <- runif(1000, -1, 3)
# Runtime comparison
all.equal(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2), designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# [1] TRUE
system.time(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 4.038 1.011 5.042
system.time(designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 0.517 0.138 0.652
The approach with outer is almost 10x faster.

Compute the area on graph of time series

My problem is that I want to compute some specific area inside of graph of time series.
This is the graph :
an this is the code how I generate my time series :
x1<-rnorm(250,0.4,0.9)
x <- as.matrix(x1)
t <- ts(x[,1], start=c(1,1), frequency=30)
plot(t,xlim=c(2,4),main="Daily closing price of Walterenergie",ylab="Adjusted close Returns",xlab="Times",col="blue")
abline(0,0)
So my question is how to compute the area in green and then how to compute the area in red separately.
If you need the cothe for the graph :
###### Graph of the performance of my portfolio
num.points <- 250
x.vals <- 1:num.points
values <- x1
plot(x.vals, values, t="l", pch=20, main="Daily closing price", xlab="Times", ylab="Adjusted close Returns", las=1)
abline(h=0, col="darkgray", lwd=2)
crossings <- values[-length(values)] * values[-1]
crossings <- which(crossings < 0)
# You can draw the points to check (uncomment following line)
# points(x.vals[crossings], values[crossings], col="red", pch="X")
intersections <- NULL
for (cr in crossings)
{
new.int <- cr + abs(values[cr])/(abs(values[cr])+abs(values[cr+1]))
intersections <- c(intersections, new.int)
}
# Again, let's check the intersections
#points(intersections, rep(0, length(intersections)), pch=20, col="red", cex=0.7)
last.intersection <- 0
for (i in intersections)
{
ids <- which(x.vals<=i & x.vals>last.intersection)
poly.x <- c(last.intersection, x.vals[ids], i)
poly.y <- c(0, values[ids], 0)
if (max(poly.y) > 0)
{
col="green"
}
else
{
col="red"
}
polygon(x=poly.x, y=poly.y, col=col)
last.intersection <- i
}

Loop in C to make RScript more efficient performance

I am trying to compute the number of pairwise differences between each row in a table of 100 rows x 2500 Columns.
I have a small RScript that does this but the run time is (obviously) extremely high!
I am trying to write a loop in C but I keep getting errors (compileCode).
Do you have any idea of how I can "convert" the following loop to C?
pw.dist <- function (vec1, vec2) {
return( length(which(vec1!=vec2)) )
}
N.row <- dim(table)[1]
pw.dist.table <- array( dim = c(dim(table)[1], dim(table)[1]))
for (i in 1:N.row) {
for (j in 1:N.row) {
pw.dist.table[i,j] <- pw.dist(table[i,-c(1)], table[j,-c(1)])
}
}
I am trying something like:
sig <- signature(N.row="integer", table="integer", pw.dist.table="integer")
code <- "
for( int i = 0; i < (*N.row) - 1; i++ ) {
for( int j = i + 1; j < *N.row; j++ ) {
int pw.dist.table = table[j] - table[i];
}
}
"
f <- cfunction( sig, code, convention=".C" )
I am a complete newbie when it comes to programming!
Thanks in advance.
JMFA
Before trying to optimize the code,
it is always a good idea to check where the time is spent.
Rprof()
... # Your loops
Rprof(NULL)
summaryRprof()
In your case, the loop is not slow, but your distance function is.
$by.total
total.time total.pct self.time self.pct
"pw.dist" 37.98 98.85 0.54 1.41
"which" 37.44 97.45 34.02 88.55
"!=" 3.12 8.12 3.12 8.12
You can rewite it as follows (it takes 1 second).
# Sample data
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
# Function to compute the number of differences
f <- function(i,j) sum(d[i,]!=d[j,])
# You could use a loop, instead of outer,
# it should not make a big difference.
d2 <- outer( 1:n, 1:n, Vectorize(f) )
Vincent above has the right idea. In addition, you can take advantage of how matrices work in R and get even faster results:
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
#precompute transpose of matrix - you can just replace
#dt with t(d) if you want to avoid this
system.time(dt <- t(d))
system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
Output:
#> system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
# user system elapsed
# 0.4 0.0 0.4
#> system.time(dt <- t(d))
# user system elapsed
# 0 0 0
#> system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
# user system elapsed
# 0.08 0.00 0.08

Resources