ggplot using stat_function in a loop - loops

I have been struggling with the following issue. I want to create a graph using ggplot, superimposing some observed datapoints and estimated functions in one and the same graph. The problem is that I want to plot many functions in the same graph, with parameters stored in a df pars. I would thus need to use a loop both to define the functions, and to add them to the plot.
I have been thinking along the following lines, but all attempts have failed so far:
d <- data.frame(xvar = c( 0.1 , 0.12 , 0.5 , 0.88, 0.9 ),
yvar = c( 0.2 , 0.24 , 0.6 , 0.76, 0.8) )
pars <- data.frame(a = c(0.77,0.56,0.48,0.98,0.67,0.87),
b = c(20.3, 15.6, 38, 27.7 , 19.8 , 23.4),
row = c(1,2,3,4,5,6))
for (i in 1:n){
fun[i] <- function(x) ( x^pars$a[row==i] + 1/(x*pars$b[row==i]) )
}
d %>%
ggplot() +
geom_point(aes(x=xvar , y=yvar)) +
for (i in 1:n){
stat_function( fun = fun[i] )
}
Any ideas how this could be done?

As a kind of general rule if you want to add layers using a loop then use lapply or the purrr::map family of functions.
In your case I would go for purrr::pmap which allows to loop over the rows of you paramter dataframe:
library(ggplot2)
ggplot(d) +
geom_point(aes(x=xvar , y=yvar)) +
purrr::pmap(pars, function(a, b, row) {
stat_function(aes(color = factor(row)), fun = function(x) x^a + 1 / x^b)
})

Related

Split, group and mean: computation with arrays

A is a given N x R xT array. I must split it horizontally to N sub-arrays of size L x M and then group each z together in an array K and take a mean.
For Example: A is the array rand(N,R,T)= rand( 16, 3 ,3); Now I am going to split it:
A=rand( 16, 3 ,3) : A(1,:,:), A(2,:,:), A(3,:,:), A(4,:,:), ... , A(16,:,:).
I have 16 slices.
B_1=A(1,:,:); B_2=A(2,:,:); B_3=A(3,:,:); ... ; B_16=A(16,:,:);
The next step is grouping together every 3 ( for example).
Now I am going create K_i as :
K_1(1,:,:)=B_1;
K_1(2,:,:)=B_2;
K_1(3,:,:)=B_3;
...
K_8(1,:,:)=B_14;
K_8(2,:,:)=B_15;
K_8(3,:,:)=B_16;
The average array is found as:
C_1=[B_1 + B_2 + B_3]/3
...
C_8= [ B_14 + B_15 + B_16] /3
I have implemented it as:
A_reshape = reshape(squeeze(A), size(A,2), size(A,3),2, []);
mean_of_all_slices = permute(mean(A_reshape , 3), [1 2 4 3]);
Question 1 I have checked by hand. It gives me a wrong result. How to fix it? [SOLVED]
EDIT 2 I need to simulate the following computation:
take a product each slice of the array K_i with another array P_p: It means:
for `K_1` is given `P_1`): `B_1 * P_1` , `B_2 * P_1`, `B_3 * P_1`
...
for `K_8` is given `P_8`): `B_14 * P_8` , `B_15 * P_8`, `B_16 * P_8`
I have solved!!!
Disclaimer: this answers a previous version of the question.
In cases such as this I would suggest relying on built-ins, which have a predictable behavior. In your case, this would be movmean (introduced in R2016a):
WIN_SZ = 2; % Window size for averaging
AVG_DIM = 1; % Dimension for averaging
tmp = movmean(A, WIN_SZ , AVG_DIM ,'Endpoints', 'discard');
C = tmp(1:WINDOW_SZ:end, :, :); % This only selects A1+A2, A3+A4 etc.
If your MATLAB is a bit older, this can also be done using convolution (convn, introduced before R2006):
WIN_SZ = 3;
tmp = convn(A, ones(WIN_SZ ,1)./WIN_SZ, 'valid'); % Shorter than A in dim1 by (WIN_SZ-1)
C = tmp(1:WINDOW_SZ:end, :, :); % dim1 size is: ceil((size(A,1)-(WIN_SZ-1))/3)
BTW, the step where you create B from slices of A can be done using
B = num2cell(A,[2,3]); % yields a 16x1 cell array of 1x3x3 double arrays

Read multidimensional NetCDF as data frame in R

I use a netCDF file which stores one variable and has following dimensions: lon, lat, time.
Generally speaking I wish to compare it against different data that I have already in R stored as dataframe - first two columns are coordinates in WGS84, and next are values for specific time.
So I wrote following code.
# since # ncFile$dim$time$units say: [1] "days since 1900-1-1"
daysFromDate <- function(data1, data2="1900-01-01")
{
round(as.numeric(difftime(data1,data2,units = "days")))
}
#study area:
lon <- c(40.25, 48)
lat <- c(16, 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
varName <- "spei"
require(ncdf4)
require(RCurl)
x <- getBinaryURL("http://digital.csic.es/bitstream/10261/104742/3/SPEI_01.nc")
ncFile <- nc_open(x)
LonIdx <- which( ncFile$dim$lon$vals >= lon[1] | ncFile$dim$lon$vals <= lon[2])
LatIdx <- which( ncFile$dim$lat$vals >= lat[1] & ncFile$dim$lat$vals <= lat[2])
TimeIdx <- which( ncFile$dim$time$vals >= myTime[1] & ncFile$dim$time$vals <= myTime[2])
MyVariable <- ncvar_get( ncFile, varName)[ LonIdx, LatIdx, TimeIdx]
I thought that data frame will be returned so that I will be able to easily manipulate data (in example - check correlation or create a plot).
Unfortunately 3-dimensional list has been returned instead.
How can I reformat this to data frame with following columns X-Y-Time1-Time2-...
So, example data will looks as follows
X Y 2014-01-01 2014-01-02 2014-01-02
50 17 0.5 0.4 0.3
where 0.5, 0.4 and 0.3 are example variable values
Or maybe there is different solution?
Ok, try following code, but it assumes that ranges are dense filled. And I changed lon test from or to and
require(ncdf4)
nc <- nc_open("SPEI_01.nc")
print(nc)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
time <- ncvar_get(nc, "time")
lonIdx <- which( lon >= 40.25 & lon <= 48.00)
latIdx <- which( lat >= 16.00 & lat <= 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
timeIdx <- which(time >= myTime[1] & time <= myTime[2])
data <- ncvar_get(nc, "spei")[lonIdx, latIdx, timeIdx]
indices <- expand.grid(lon[lonIdx], lat[latIdx], time[timeIdx])
print(length(indices))
class(indices)
summary(indices)
str(indices)
df <- data.frame(cbind(indices, as.vector(data)))
summary(df)
str(df)
UPDATE
ok, looks like I got the idea what do you want, but have do direct solution. What I've got so far is this: split data frame using either split() function or data.table package. After splitting by X&Y, you'll get lists of small data frames where X&Y are a constant for a given frame. Probably is it possible to transpose and recombine them back, but I have no idea how. It might be a good idea to continue to work with data as columns, Lists are nested, could be flattened, and here is link for splitting in R: http://www.uni-kiel.de/psychologie/rexrepos/posts/dfSplitMerge.html
Code, as continued from previous example
require(data.table)
colnames(df) <- c("X","Y","Time","spei")
df$Time <- as.Date(df$Time, origin="1900-01-01")
dt <- as.data.table(df)
summary(dt)
# Taken from https://github.com/Rdatatable/data.table/issues/1389
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
# here is data.table split
q <- split.data.table(dt, by = c("X","Y"), drop=FALSE)
str(q)
# here is data frame split
qq <- split(df, list(df$X, df$Y))
str(qq)

R - avoid nested for loops

I have the following function which takes 4 vectors. The T vector has a given length and all 3 other vectors (pga, Sa5Hz and Sa1Hz) have a given (identical but not necessarily equal to T) lenght.
The output is a matrix with length(T) rows and length(pga) columns.
My code below seems like the perfect example of what NOT to do, however, I could not figure out a way to optimize it using an apply function. Can anyone help?
designSpectrum <- function (T, pga, Sa5Hz, Sa1Hz){
Ts <- Sa1Hz / Sa5Hz
#By convention, if Sa5Hz is null, set Ts as 0.
Ts[is.nan(Ts)] <- 0
res <- matrix(NA, nrow = length(T), ncol = length(pga))
for (i in 1:nrow(res))
{
for (j in 1:ncol(res))
{
res[i,j] <- if(T[i] <= 0) {pga[j]}
else if (T[i] <= 0.2 * Ts[j]) {pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j])}
else if (T[i] <= Ts[j]) {Sa5Hz[j]}
else Sa1Hz[j] / T[i]
}
}
return(res)
}
Instead of doing a double for loop and processing each i and j value separately, you could use the outer function to process all of them in one shot. Since you're now processing multiple i and j values simultaneously, you could switch to the vectorized ifelse statement instead of the non-vectorized if and else statements:
designSpectrum2 <- function (T, pga, Sa5Hz, Sa1Hz) {
Ts <- Sa1Hz / Sa5Hz
Ts[is.nan(Ts)] <- 0
outer(1:length(T), 1:length(pga), function(i, j) {
ifelse(T[i] <= 0, pga[j],
ifelse(T[i] <= 0.2 * Ts[j], pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j]),
ifelse(T[i] <= Ts[j], Sa5Hz[j], Sa1Hz[j] / T[i])))
})
}
identical(designSpectrum(T, pga, Sa5Hz, Sa1Hz), designSpectrum2(T, pga, Sa5Hz, Sa1Hz))
# [1] TRUE
Data:
T <- -1:3
pga <- 1:3
Sa5Hz <- 2:4
Sa1Hz <- 3:5
You can see the efficiency gains by testing on rather large vectors (here I'll use an output matrix with 1 million entries):
# Larger vectors
set.seed(144)
T2 <- runif(1000, -1, 3)
pga2 <- runif(1000, -1, 3)
Sa5Hz2 <- runif(1000, -1, 3)
Sa1Hz2 <- runif(1000, -1, 3)
# Runtime comparison
all.equal(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2), designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# [1] TRUE
system.time(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 4.038 1.011 5.042
system.time(designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 0.517 0.138 0.652
The approach with outer is almost 10x faster.

Loop in C to make RScript more efficient performance

I am trying to compute the number of pairwise differences between each row in a table of 100 rows x 2500 Columns.
I have a small RScript that does this but the run time is (obviously) extremely high!
I am trying to write a loop in C but I keep getting errors (compileCode).
Do you have any idea of how I can "convert" the following loop to C?
pw.dist <- function (vec1, vec2) {
return( length(which(vec1!=vec2)) )
}
N.row <- dim(table)[1]
pw.dist.table <- array( dim = c(dim(table)[1], dim(table)[1]))
for (i in 1:N.row) {
for (j in 1:N.row) {
pw.dist.table[i,j] <- pw.dist(table[i,-c(1)], table[j,-c(1)])
}
}
I am trying something like:
sig <- signature(N.row="integer", table="integer", pw.dist.table="integer")
code <- "
for( int i = 0; i < (*N.row) - 1; i++ ) {
for( int j = i + 1; j < *N.row; j++ ) {
int pw.dist.table = table[j] - table[i];
}
}
"
f <- cfunction( sig, code, convention=".C" )
I am a complete newbie when it comes to programming!
Thanks in advance.
JMFA
Before trying to optimize the code,
it is always a good idea to check where the time is spent.
Rprof()
... # Your loops
Rprof(NULL)
summaryRprof()
In your case, the loop is not slow, but your distance function is.
$by.total
total.time total.pct self.time self.pct
"pw.dist" 37.98 98.85 0.54 1.41
"which" 37.44 97.45 34.02 88.55
"!=" 3.12 8.12 3.12 8.12
You can rewite it as follows (it takes 1 second).
# Sample data
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
# Function to compute the number of differences
f <- function(i,j) sum(d[i,]!=d[j,])
# You could use a loop, instead of outer,
# it should not make a big difference.
d2 <- outer( 1:n, 1:n, Vectorize(f) )
Vincent above has the right idea. In addition, you can take advantage of how matrices work in R and get even faster results:
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
#precompute transpose of matrix - you can just replace
#dt with t(d) if you want to avoid this
system.time(dt <- t(d))
system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
Output:
#> system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
# user system elapsed
# 0.4 0.0 0.4
#> system.time(dt <- t(d))
# user system elapsed
# 0 0 0
#> system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
# user system elapsed
# 0.08 0.00 0.08

R - Vector/ Array Addition

I a having a little trouble with vector or array operations.
I have three 3D arrays and i wanna find the average of them. How can i do that? we can't use mean() as it only returns a single value.
The more important is some of the cells in the arrays are NA whic mean if i just add them like
A = (B + C + D)/3
The results of will show NA as well.
How can i let it recognise if the cell is NA then just skip it.
Like
A = c(NA, 10, 15, 15, NA)
B = c(10, 15, NA, 22, NA)
C = c(NA, NA, 20, 26, NA)
I wanna the output of average these vectors be
(10, (10+15)/2, (15+20)/2, (15+22+26)/3, NA)
We also can't use na.omit, because it will move the order of indexes.
This is the corresponding code. i wish it would be helpful.
for (yr in 1950:2011) {
temp_JFM <- sst5_sst2[,,year5_sst2==yr & (month5_sst2>=1 & month5_sst2<=3)]
k = 0
jfm=4*k+1
for (i in 1:72) {
for (j in 1:36) {
iposst5_sst2[i,j,jfm] <- (temp_JFM[i,j,1]+temp_JFM[i,j,2]+temp_JFM[i,j,3])/3
}
}
}
Thnk you.
It already been solved.
The easiest way to correct it can be shown below.
iposst5_sst2[i,j,jfm] <- mean(temp_JFM[i,j,],na.rm=TRUE)
I'm not entirely sure what your desired output is, but I'm guessing that what you really want to build is not three 3D arrays, but one 4D array that you can then use apply on.
Something like this:
#Three 3D arrays...
A <- array(runif(1:27),dim = c(3,3,3))
B <- array(runif(1:27),dim = c(3,3,3))
C <- array(runif(1:27),dim = c(3,3,3))
#Become one 4D array
D <- array(c(A,B,C),dim = c(3,3,3,3))
#Now we can simply apply the function mean
# and use it's na.rm = TRUE argument.
apply(D,1:3,mean,na.rm = TRUE)
Here's an example which makes a vector of the three values, which makes na.omit usable:
vectorAverage <- function(A,B,C) {
Z <- rep(NA, length(A))
for (i in 1:length(A)) {
x <- na.omit(c(A[i],B[i],C[i]))
if (length(x) > 0) Z[i] = mean(x)
}
Z
}
Resulting in:
vectorAverage(A,B,C)
[1] 10.0 12.5 17.5 21.0 NA
Edited: Missed the NaN in the output of the first version.

Resources