R: Adding columns from one data frame to another, non-matching number of rows - rscript

I have a .txt file with millions of rows of data - DateTime (1-min intervals) and Precipitation.
I have a .csv file with thousands of rows of data - DateTime (daily intevals), MaxTemp, MinTemp, WindSpd, WindDir.
I import the .txt file as a data frame and do a few transformations. I then move this into a new data frame.
I import the .csv file as a data frame do a few transformations. I then want to add the columns from this data frame into the new data frame (total of 7 columns). However, R throws an error: "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 10382384, 32868, 1"
I know the number of rows is different, however, this is the format I need for the next step in processing. This could be easily done in Excel were it not for the crazy amount of rows.
Simulated code is below, which produces the same error:
a <- as.character(c(1,2,3,4,5,6,7,8,9,10))
b <- c(paste("Date", a))
c <- c(rnorm(10, mean = 5, sd = 2.1))
Frame1 <- data.frame(b,c)
d <- as.character(c(1,2,3))
e <- c(paste("Date", d))
f <- c(rnorm(3, mean = 1, sd = 0.7))
g <- c(rnorm(3, mean = 3, sd = 2))
h <- c(rnorm(3, mean = 8, sd = 1))
Frame2 <- data.frame(e,f,g,h)
NewFrame <- cbind(Frame1)
NewFrame <- cbind(NewFrame, Frame2)
I have tried a *_join but it throws error: "Error: by must be supplied when x and y have no common variables.
i use by = character()` to perform a cross-join." which to me reads like it wants to match things up, which I don't need. I really just need to plop these two datasets side-by-side for the next processing step. Help?

The data frames MUST have an equal number of rows. To compensate then, I just added a bunch of rows to the smaller dataset so that it contains the same amount of rows as the larger dataset (in my case, it will always be the .csv file) and filled it with "NA" values. The following application I use for downstream processing knows how to handle the "NA" values so this works well for me.
I've run the solution with a representative dataset and I am able to cbind the two data frames together.
Sample code with the simulated dataset:
#create data frame 1
a <- as.character(c(1:10))
b <- c(paste("Date", a))
c <- c(rnorm(10, mean = 5, sd = 2.1))
Frame1 <- data.frame(b,c)
#create date frame 2
d <- as.character(c(1,2,3))
e <- c(paste("Date", d))
f <- c(rnorm(3, mean = 1, sd = 0.7))
g <- c(rnorm(3, mean = 3, sd = 2))
h <- c(rnorm(3, mean = 8, sd = 1))
Frame2 <- data.frame(e,f,g,h)
#find the maximum number of rows
maxlen <- max(nrow(Frame1), nrow(Frame2))
#finds the minimum number of rows
rowrow <- min(nrow(Frame1), nrow(Frame2))
#adds enough rows to the smaller dataset to equal the number of rows
#in the larger dataset. Populates the rows with "NA" values
Frame2[rowrow+(maxlen-rowrow),] <- NA
#creates the new data frame from the two frames
NewFrame <- cbind(NewFrame, Frame2)

Related

Altering arrays to add/remove entries at each time-step in R

This question, probably has a simple solution but I cannot think of how to do it...
So I have a script as follows:
# ------------------ MODEL SETUP ----------------------------------------# simulation length
t_max <- 50
# arena
arena_x <- 100
arena_y <- 100
# plant parameters
a <- 0.1
b <- 0.1
g <- 1
# list of plant locations and initial sizes
nplants <-dim(plantLocsX)[1]*dim(plantLocsX)[2]
iterations<-5
totalBiomass<-matrix(0,nrow=iterations,ncol=1)
# starting loop
sep <- 10
# Original matrix
plantLocsX <- matrix(rep(seq(0,arena_x,sep), arena_y/sep),
nrow=1+arena_x/sep,
ncol=1+arena_y/sep)
plantLocsY <- t(plantLocsX)
plantSizes <- matrix(1,nrow=nplants,ncol=1)
# Plot the plants
radius <- sqrt( plantSizes/ pi )
symbols(plantLocsX, plantLocsY, radius, xlim = c(0,100), ylim=c(0,100), inches=0.05, fg = "green",
xlab = "x domain (m)", ylab = "y domain (m)", main = "Random Plant Locations", col.main = 51)
# Calculate distances between EACH POSSIBLE PAIR of plants
distances <- matrix(0,nrow=nplants,ncol=nplants)
for (i in 1:nplants){
for (j in 1:nplants){
distances[i,j] <- sqrt( (plantLocsX[i]-plantLocsX[j])^2 + (plantLocsY[i]-plantLocsY[j])^2 )
}
}
# ------------------ MODEL RUNNING ---------------------------------------
I need to alter the arrays containing plant locations and plant sizes so that at each time step, entries are removed and added (simulating mortality/reproduction, respectively). The "distances" must be updated with plant locations and sizes after each iteration...I can only think of complex ways to do this: destructing and constructing new matrices at each time step to fit the new number of elements but there must be functions to make this simpler....any advice?
Many thanks!!

Read multidimensional NetCDF as data frame in R

I use a netCDF file which stores one variable and has following dimensions: lon, lat, time.
Generally speaking I wish to compare it against different data that I have already in R stored as dataframe - first two columns are coordinates in WGS84, and next are values for specific time.
So I wrote following code.
# since # ncFile$dim$time$units say: [1] "days since 1900-1-1"
daysFromDate <- function(data1, data2="1900-01-01")
{
round(as.numeric(difftime(data1,data2,units = "days")))
}
#study area:
lon <- c(40.25, 48)
lat <- c(16, 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
varName <- "spei"
require(ncdf4)
require(RCurl)
x <- getBinaryURL("http://digital.csic.es/bitstream/10261/104742/3/SPEI_01.nc")
ncFile <- nc_open(x)
LonIdx <- which( ncFile$dim$lon$vals >= lon[1] | ncFile$dim$lon$vals <= lon[2])
LatIdx <- which( ncFile$dim$lat$vals >= lat[1] & ncFile$dim$lat$vals <= lat[2])
TimeIdx <- which( ncFile$dim$time$vals >= myTime[1] & ncFile$dim$time$vals <= myTime[2])
MyVariable <- ncvar_get( ncFile, varName)[ LonIdx, LatIdx, TimeIdx]
I thought that data frame will be returned so that I will be able to easily manipulate data (in example - check correlation or create a plot).
Unfortunately 3-dimensional list has been returned instead.
How can I reformat this to data frame with following columns X-Y-Time1-Time2-...
So, example data will looks as follows
X Y 2014-01-01 2014-01-02 2014-01-02
50 17 0.5 0.4 0.3
where 0.5, 0.4 and 0.3 are example variable values
Or maybe there is different solution?
Ok, try following code, but it assumes that ranges are dense filled. And I changed lon test from or to and
require(ncdf4)
nc <- nc_open("SPEI_01.nc")
print(nc)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
time <- ncvar_get(nc, "time")
lonIdx <- which( lon >= 40.25 & lon <= 48.00)
latIdx <- which( lat >= 16.00 & lat <= 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
timeIdx <- which(time >= myTime[1] & time <= myTime[2])
data <- ncvar_get(nc, "spei")[lonIdx, latIdx, timeIdx]
indices <- expand.grid(lon[lonIdx], lat[latIdx], time[timeIdx])
print(length(indices))
class(indices)
summary(indices)
str(indices)
df <- data.frame(cbind(indices, as.vector(data)))
summary(df)
str(df)
UPDATE
ok, looks like I got the idea what do you want, but have do direct solution. What I've got so far is this: split data frame using either split() function or data.table package. After splitting by X&Y, you'll get lists of small data frames where X&Y are a constant for a given frame. Probably is it possible to transpose and recombine them back, but I have no idea how. It might be a good idea to continue to work with data as columns, Lists are nested, could be flattened, and here is link for splitting in R: http://www.uni-kiel.de/psychologie/rexrepos/posts/dfSplitMerge.html
Code, as continued from previous example
require(data.table)
colnames(df) <- c("X","Y","Time","spei")
df$Time <- as.Date(df$Time, origin="1900-01-01")
dt <- as.data.table(df)
summary(dt)
# Taken from https://github.com/Rdatatable/data.table/issues/1389
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
# here is data.table split
q <- split.data.table(dt, by = c("X","Y"), drop=FALSE)
str(q)
# here is data frame split
qq <- split(df, list(df$X, df$Y))
str(qq)

How to create a grid from 1D array using R?

I have a file which contains a 209091 element 1D binary array representing the global land area
which can be downloaded from here:
ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_flags_2002/
I want to create a full from the 1D data arrays using provided ancillary row and column files .globland_r and globland_c which can be downloaded from here:
ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_ancil/
There is a code written in Matlab for this purpose and I want to translate this Matlab code to R but I do not know Matlab
function [gridout, EASE_r, EASE_s] = mkgrid_global(x)
%MKGRID_GLOBAL(x) Creates a matrix for mapping
% gridout = mkgrid_global(x) uses the 2090887 element array (x) and returns
%Load ancillary EASE grid row and column data, where <MyDir> is the path to
%wherever the globland_r and globland_c files are located on your machine.
fid = fopen('C:\MyDir\globland_r','r');
EASE_r = fread(fid, 209091, 'int16');
fclose(fid);
fid = fopen('C:\MyDir\globland_c','r');
EASE_s = fread(fid, 209091, 'int16');
fclose(fid);
gridout = NaN.*zeros(586,1383);
%Loop through the elment array
for i=1:1:209091
%Distribute each element to the appropriate location in the output
%matrix (but MATLAB is
%(1,1)
end
EDit following the solution of #mdsumner:
The files MLLATLSB and MLLONLSB (4-byte integers) contain latitude and longitude (multiply by 1e-5) for geo-locating the full global EASE grid matrix (586×1383)
MLLATLSB and MLLONLSB can be downloaded from here:
ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_ancil/
## the sparse dims, literally the xcol * yrow indexes
dims <- c(1383, 586)
cfile <- "ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_ancil/globland_c"
rfile <- "ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_ancil/globland_r"
## be nice, don't abuse this
col <- readBin(cfile, "integer", n = prod(dims), size = 2, signed = FALSE)
row <- readBin(rfile, "integer", n = prod(dims), size = 2, signed = FALSE)
## example data file
fdat <- "ftp://sidads.colorado.edu/DATASETS/nsidc0451_AMSRE_Land_Parms_v01/AMSRE_flags_2002/flags_2002170A.bin"
dat <- readBin(fdat, "integer", n = prod(dims), size = 1, signed = FALSE)
## now get serious
m <- matrix(as.integer(NA), dims[2L], dims[1L])
m[cbind(row + 1L, col + 1L)] <- dat
image(t(m)[,dims[2]:1], col = rainbow(length(unique(m)), alpha = 0.5))
Maybe we can reconstruct this map projection too.
flon <- "MLLONLSB"
flat <- "MLLATLSB"
## the key is that these are integers, floats scaled by 1e5
lon <- readBin(flon, "integer", n = prod(dims), size = 4) * 1e-5
lat <- readBin(flat, "integer", n = prod(dims), size = 4) * 1e-5
## this is all we really need from now on
range(lon)
range(lat)
library(raster)
library(rgdal) ## need for coordinate transformation
ex <- extent(projectExtent(raster(extent(range(lon), range(lat)), crs = "+proj=longlat"), "+proj=cea"))
grd <- raster(ncols = dims[1L], nrows = dims[2L], xmn = xmin(ex), xmx = xmax(ex), ymn = ymin(ex), ymx = ymax(ex), crs = "+proj=cea")
There is probably an "out by half pixel" error in there, left as an exercise.
Test
plot(setValues(grd, m), col = rainbow(max(m, na.rm = TRUE), alpha = 0.5))
Hohum
library(maptools)
data(wrld_simpl)
plot(spTransform(wrld_simpl, CRS(projection(grd))), add = TRUE)
We can now save the valid cellnumbers to match our "grd" template, then read any particular dat-file and just populate the template with those values based on cellnumbers. Also, it seems someone trod nearly this path earlier but not much was gained:
How to identify lat and long for a global matrix?

Outputting multiple arrays of data in R

I have a code that loops through multiple subjects and outputs the run lengths of consecutive 1's in various arrays. The output is something like this:
Variable1RunLengths 2 3 14 12 7 8
Variable2RunLengths 4 9 8 12 4 7 3
And it does this for multiple subjects. I know how to output single variable to a data frame, but I am having trouble outputting the arrays of data I'm calculating with this code. Any suggestions?
GetRL<-function(df) {
subjects <- unique(df.all$Subject)
numsubjects <- length(subjects)
runLengths.df <- data.frame()
for (i in 1:numsubjects) {
subj <- subjects[i]##names loop variable
subdf <- df.all[which(df.all$Subject == subj),] ##pulls all data for current subject
## pulls vectors within current subject for each task
patrmdf <- subdf$Patient_Room
compdf <- subdf$comp
pertoperdf <- subdf$pertoper
paperdf <- subdf$paper
##calculates runs of ones for each task, pulls lengths or all values = 1
patrmall <- rle(patrmdf)
patrmruns <- patrmall$lengths[patrmall$values == 1]
patrmslength <- length(patrmruns)
compall <- rle(compdf)
compruns <- compall$lengths[compall$values == 1]
complength <- length(compruns)
pertoperall <- rle(pertoperdf)
pertoperruns <- pertoperall$lengths[pertoperall$values == 1]
pertoperlength <- length(pertoperruns)
paperall <- rle(paperdf)
paperruns <- paperall$lengths[paperall$values == 1]
paperlength <- length(paperruns)
##outputs vectors and variables
runLengths.df <- subj
runLengths.df<- patrmruns
runLengths.df<- compruns
runLengths.df<- pertoperruns
runLengths.df <- paperruns
}
return(runLengths.df)
}
A data frame is a poor choice of data structure for this, because you have arrays that can be different sizes. I would try a list of lists. Outside the loop, you would initialize
runLengths<-list()
Then at the bottom of the loop, you would do
runLengths$subj<-list(patrm=patrmruns,
comp=compruns,
pertoper=pertoperruns,
paper=paperruns)
Then, for example, to recover the comp run lengths for subject XYZ you would write
runLengths$XYZ$comp

How to read multiple files into a multi-dimensional array

I want to make array in 3 dimension.
Here is what I tried:
z<-c(160,720,420)
first_data_set <-array(dim = length(file_1), dimnames = z)
Data that I am reading is in one level. (only x and y)
There are other data in the same format, and I need to put them in the same array with the first data. So once I finish reading all data, all of them are in the same array but there is no overwriting.
So I think array has to be 3 dimensions; otherwise I cannot keep all data that I read in loop.
Say that you have two matrices of size 3x4:
m1 <- matrix(rnorm(12), nrow = 3, ncol = 4)
m2 <- matrix(rnorm(12), nrow = 3, ncol = 4)
If you want to place them in an array, first make an array of NA's:
A <- array(as.numeric(NA), dim = c(3,4,2))
Then populate the layers with data:
A[,,1] <- m1
A[,,2] <- m2
As suggested by #Justin, you could also just put the matrices together in a list:
A2 <- list()
A2[['m1']] <- m1
A2[['m2']] <- m2
To read matrices from files: using a list makes it easier to get these matrices from files in a directory, without having to specify the dimensions in advance. Assume you want all files with extension csv:
myfiles <- dir(pattern = ".csv")
for (i in 1:length(myfiles)){
A2[[myfiles[i]]] <- read.table(myfiles[i], sep = ',')
}

Resources