R e1071 trains faster than Libsvm - c

I am trying to use nu and epsilon SVR in library Libsvm in C after trying to use it in R (package e1071). But training time differs too much, in R it took less than a sec to generate my model, and in C it took more than 2 minutes. Is this a normal behavior?
In C I am running it like this:
./svm-train -s 3 -t 0 -q series.train series.model
I have tried giving more paramters, using more cache memory, and adding -fopenmp flag to compilation (it added about 10 extra secs to running time).
Any hint would be appreciated!
Edit:
My training file:
5.7367 1:1
5.46076 1:2
4.80722 1:3
4.80722 1:4
4.64745 1:5
4.66924 1:6
4.52401 1:7
4.76364 1:8
4.06652 1:9
4.03748 1:10
...
...
...
64.02734 1:1999
63.6241 1:2000
It is supposed to be a financial time series, the first column is the close price of the stock, and the first index is just a numerical value in increasing order (instead of the date).
R Code:
data<-read.csv("/home/manzha/series/GCARSO.csv",header=TRUE)
X <- c(1:2310) # 2310 is the total of rows
trainL<-2000
X_train <- c(1:trainL)
X_test <- c((trainL+1):length(X))
Y_test<-data$Adj.Close[(trainL+1):length(X)]
Y_train <- data$Adj.Close[1:trainL]
DF <- data.frame(x = X_train, y = Y_train)
model <- svm(y ~ x, data = DF, kernel= "linear", cost=2, epsilon=0.5, type="eps-regression")
predictedY <- predict(model, newdata= data.frame(x= X_test))
I have read some of the e1071 library doc and it scales the data automatically. I have tried scalling my data and it takes no time in training! But the results differ from R.
If I write:
model <- svm(y ~ x, data = DF, kernel= "linear", cost=2, epsilon=0.5, type="eps-regression",scale=FALSE)
R takes more time doing the training almost the same than C libsvm, and it gives the same result.
So for now, I think that my problem is how my training and test file after scaling are written.

Related

shapes not aligned error when performing Singular Value Decomposition using scipy.sparse.linalg

I am trying to use Singular Value Decomposition (SVD) to predict missing values in a sparse matrix. Chapter 4 of the "Building Recommendation Engines in Python" Datacamp course provides an example of doing this with movie ratings, which is great. As a first step, I have been trying to replicate this Datacamp example on my local PC using Jupyter Notebook. However, when I try to multiply the U_Sigma and Vt matricies which are output from the "svds" function, I get an error:
ValueError: shapes (671,) and (6,9161) not aligned: 671 (dim 0) != 6 (dim 0)
I am using this dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset/version/7?select=ratings_small.csv
Here is the code I am trying to run:
import pandas as pd
filename = 'ratings_small.csv'
df = pd.read_csv(filename)
df.head()
user_ratings_df = df.pivot(index='userId', columns='movieId', values='rating')
# Get the average rating for each user
avg_ratings = user_ratings_df.mean(axis=1)
# Center each user's ratings around 0
user_ratings_centered = user_ratings_df.sub(avg_ratings, axis=1)
# Fill in all missing values with 0s
user_ratings_centered.fillna(0, inplace=True)
# Print the mean of each column
print(user_ratings_centered.mean(axis=1))
######################
# Import the required libraries
from scipy.sparse.linalg import svds
import numpy as np
# Decompose the matrix
U, sigma, Vt = svds(user_ratings_centered)
## Now that you have your three factor matrices, you can multiply them back together to get complete ratings data
# without missing values. In this exercise, you will use numpy's dot product function to multiply U and sigma first,
# then the result by Vt. You will then be able add the average ratings for each row to find your final ratings.
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)
# Dot product of result and Vt
U_sigma_Vt = np.dot(U_sigma, Vt)
There was a missing line of code. After running "svds" to decompose the matrix, we need this line:
# Convert sigma into a diagonal matrix
sigma = np.diag(sigma)

R - apply function on each element of array in parallel

I have measurements of maximum and minimum temperature and precipitation that are organized as arrays of size
(100x96x50769), where i and j are grid cells with coordinates associated and z means the number of measurements over time.
Conceptually, it looks like this:
I am using the climdex.pcic package to calculate indices of extreme weather events. Given a time series of maximum and minimum temperature and precipitation, the climdexInput.raw function will return a climdexIput object that can be used to determine several indices: number of frost days, number of summer days, consecutive dry days etc.
The call for the function is pretty simple:
ci <- climdexInput.raw(tmax=x, tmin=y, prec=z,
t, t, t, base.range=c(1961,1990))
where x is a vector of maximum temperatures, y is a vector of minimum temperatures, z is a vector of precipitation and t is a vector with dates under which x, y and z were measured.
What I would like to do is to extract the timeseries for each element of my array (i.e. each grid cell in the figure above) and use it to run the climdexInput.raw function.
Because of the large number of elements of real data, I want to run this task in parallel on my 4-core Linux server. However, I have no experience with parallelization in R.
Here's one example of my program (with intentionally reduced dimensions to make execution faster on your computer):
library(climdex.pcic)
# Create some dates
t <- seq(as.Date('2000-01-01'), as.Date('2010-12-31'), 'day')
# Parse the dates into PCICt
t <- as.PCICt(strftime(t), cal='gregorian')
# Create some dummy weather data, with dimensions `# of lat`, `# of lon` and `# of timesteps`
nc.min <- array(runif(10*9*4018, min=0, max=15), c(10, 9, 4018))
nc.max <- array(runif(10*9*4018, min=25, max=40), c(10, 9, 4018))
nc.prc <- array(runif(10*9*4018, min=0, max=25), c(10, 9, 4018))
# Create "ci" object
ci <- climdexInput.raw(tmax=nc.max[1,1,], tmin=nc.min[1,1,], prec=nc.prc[1,1,],
t, t, t, base.range=c(2000,2005))
# Once you have “ci”, you can compute any of the indices provided by the climdex.pcic package.
# The example below is for cumulative # of dry days per year:
cdd <- climdex.cdd(ci, spells.can.span.years = TRUE)
Now, please note that in the example above I used only the first element of my array ([1,1,]) as an example in the climdexInput.raw function.
How can do the same for all elements taking advantage of parallel processing, possibly by looping over the dimensions i and j of my array?
You can use foreach to do that:
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
res <- foreach(j = seq_len(ncol(nc.min))) %:%
foreach(i = seq_len(nrow(nc.min))) %dopar% {
ci <- climdex.pcic::climdexInput.raw(
tmax=nc.max[i,j,],
tmin=nc.min[i,j,],
prec=nc.prc[i,j,],
t, t, t,
base.range=c(2000,2005)
)
}
stopCluster(cl)
See my guide on parallelism using foreach: https://privefl.github.io/blog/a-guide-to-parallelism-in-r/.
Then, to compute an index, just use climdex.cdd(res[[1]][[1]], spells.can.span.years = TRUE) (j first, i second).

Best way to pick random elements from an array with at least a min diff in R

I would like to randomly choose from an array a certain number of elements in a way that those respect always a limit in their reciprocal distance.
For example, having a vector a <- seq(1,1000), how can I pick 20 elements with a minimum distance of 15 between each other?
For now, I am using a simple iteration for which I reject the choice whenever is too next to any element, but it is cumbersome and tends to be long if the number of elements to pick is high. Is there a best-practice/function for this?
EDIT - Summary of answers and analysis
So far I had two working answers which I wrapped in two specific functions.
# dash2 approach
# ---------------
rand_pick_min <- function(ar, min.dist, n.picks){
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
picked <- array(NA, n.picks)
copy <- ar
for (i in 1:n.picks) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min.dist ]
}
return(picked)
}
# denis approach
# ---------------
rand_pick_min2 <- function(ar, min.dist, n.picks){
require(Surrogate)
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
lar <- length(ar)
dist <- Surrogate::RandVec(a=min.dist, b=(lar-(n.picks)*min.dist),
s=lar, n=(n.picks+1), m=1, Seed=sample(1:lar, size = 1))$RandVecOutput
return(cumsum(round(dist))[1:n.picks])
}
Using the same example proposed I run 3 tests. Firstly, the effective validity of the minimum limit
# Libs
require(ggplot2)
require(microbenchmark)
# Inputs
a <- seq(1, 1000) # test vector
md <- 15 # min distance
np <- 20 # number of picks
# Run
dist_vec <- c(sapply(1:500, function(x) c(dist(rand_pick_min(a, md, np))))) # sol 1
dist_vec2 <- c(sapply(1:500, function(x) c(dist(rand_pick_min2(a, md, np))))) # sol 2
# Tests - break the min
cat('Any distance breaking the min in sol 1?', any(dist_vec < md), '\n') # FALSE
cat('Any distance breaking the min in sol 2?', any(dist_vec2 < md), '\n') # FALSE
Secondly, I tested for the distribution of the resulting distances, obtaining the first two plots in order of solution (sol1 [A] is dash2's sol, while sol2 [B] is denis' one).
pa <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
pb <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec2), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
print(pa)
print(pb)
Lastly, I computed the computational times needed for the two approaches as following and obtaining the last figure.
comp_times <- microbenchmark::microbenchmark(
'solution_1' = rand_pick_min(a, md, np),
'solution_2' = rand_pick_min2(a, md, np),
times = 500
)
ggplot2::autoplot(comp_times); ggsave('stckoverflow2.png')
Enlighted by the results, I am asking my-self if the distance distribution as it is should be expected or it is a deviation due to the applied methods.
EDIT2 - Answer to the last question, following the comment made by denis
Using many more sampling procedures (5000), I produced a pdf of the resulting positions and indeed your approach contains some artefact that makes your solution (B) deviate from the one I needed. Nonetheless, it would be interesting to have the ability to enforce a specific final distribution of positions.
If you want to avoid the hit and miss methods, you will have to translate your problem into a sampling of distances with constraints on the sum of your distances.
Basically how i translate what you want: your N positions sampled are equivalent to N+1 distance, ranging from the minimum distance to the size of your vector - N*mindist (the case where all your samples are packed together). You then need to constrain the sum of the distances to be equal to 1000 (the size of your vector).
In this case the solution will use Surrogate::RandVec from Surrogate package (see Random sampling to give an exact sum), that allows a sampling with a fixed sum.
library(Surrogate)
a <- seq(1,1000)
mind <- 15
N <- 20
dist <- Surrogate::RandVec(a=mind, b=(1000-(N)*mind), s=1000, n=(N+1), m=1, Seed=sample(1:1000, size = 1))$RandVecOutput
pos <- cumsum(round(dist))[1:20]
pos
> pos
[1] 22 59 76 128 204 239 289 340 389 440 489 546 567 607 724 773 808 843 883 927
dist is the sampling f the distance. You reconstruct your position by making the sum of the distances. It gives you pos, the vector of your index positions.
The advantage is that you can get any value, and that your sampling is supposed to be random. For the speed part I don't know, you'll need to compare to your method for your big data case.
Here is an histogramm of 1000 try:
I think the best solution, which guarantees randomness in some sense (I'm not exactly sure what sense!) may be:
Pick a random element
Remove all elements that are too close to that element
Pick another element
Return to 2.
So:
min_dist <- 15
a <- seq(1, 1000)
picked <- integer(20)
copy <- a
for (i in 1:20) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min_dist ]
}
Whether this is faster than sample-and-reject may depend on the characteristics of the original vector. Also, as you can see, you are not guaranteed to be able to get all the elements you want, though in your particular case there won't be a problem because 19 intervals of width 30 could never cover the whole of seq(1, 1000).

Plot 3d surface map from data frame

I first begin by running the code below to tune a SVM:
tunecontrol <- tune.control(nrepeat=5, sampling = "fix",cross=5, performances=T)
tune_svm1 <- tune(svm,
Y ~ 1
+ X
, data = data,
ranges = list(epsilon = seq(epsilon_start
,epsilon_end
,(epsilon_end-epsilon_start)/10)
, cost = cost_start*(1:5)
, gamma = seq(gamma_start
,gamma_end
,(gamma_end - gamma_start)/5))
, tunecontrol=tunecontrol)
In tune_svm1$performances I have 330 observations containing all the values for epsilon, cost, and gamma that I stated in the ranges section of the above code as well as another column for the calculated error.
I'd like to generate a 3D surface plot for epsilon, cost, gamma, and error using three variables as X,Y,Z and the last for color. I've read on several resources for plot3d and persp but have had a lot of difficulty implementing.
If I try to follow the examples provided and use mesh to generate a mesh plot, I can only mesh together 3 of the 4 variables from tune_svm1$performances and saving the separate results for X,Y and Z as shown in the first link is difficult because the mesh is saved as an array, not a matrix. I've tried to hack a graph using the following code but the visual is nonsensical (probably because the order isn't being preserved by meshing each individually:
M1 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$cost[1:nrow(tune_svm1$performances)])
M2 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$gamma[1:nrow(tune_svm1$performances)])
M3 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$error[1:nrow(tune_svm1$performances)])
x <- M1$x ; y <- M1$y ; z <- M2$y ; c <- M3$y
surf3D(x,y,c, colvar = c)
What's the best way to approach this? Thank you.

How do I create a progress bar for data loading in R?

Is it possible to create a progress bar for data loaded into R using load()?
For a data analysis project large matrices are being loaded in R from .RData files, which take several minutes to load. I would like to have a progress bar to monitor how much longer it will be before the data is loaded. R already has nice progress bar functionality integrated, but load() has no hooks for monitoring how much data has been read. If I can't use load directly, is there an indirect way I can create such a progress bar? Perhaps loading the .RData file in chucks and putting them together for R. Does any one have any thoughts or suggestions on this?
I came up with the following solution, which will work for file sizes less than 2^32 - 1 bytes.
The R object needs to be serialized and saved to a file, as done by the following code.
saveObj <- function(object, file.name){
outfile <- file(file.name, "wb")
serialize(object, outfile)
close(outfile)
}
Then we read the binary data in chunks, keeping track of how much is read and updating the progress bar accordingly.
loadObj <- function(file.name){
library(foreach)
filesize <- file.info(file.name)$size
chunksize <- ceiling(filesize / 100)
pb <- txtProgressBar(min = 0, max = 100, style=3)
infile <- file(file.name, "rb")
data <- foreach(it = icount(100), .combine = c) %do% {
setTxtProgressBar(pb, it)
readBin(infile, "raw", chunksize)
}
close(infile)
close(pb)
return(unserialize(data))
}
The code can be run as follows:
> a <- 1:100000000
> saveObj(a, "temp.RData")
> b <- loadObj("temp.RData")
|======================================================================| 100%
> all.equal(b, a)
[1] TRUE
If we benchmark the progress bar method against reading the file in a single chunk we see the progress bar method is slightly slower, but not enough to worry about.
> system.time(unserialize(readBin(infile, "raw", file.info("temp.RData")$size)))
user system elapsed
2.710 0.340 3.062
> system.time(b <- loadObj("temp.RData"))
|======================================================================| 100%
user system elapsed
3.750 0.400 4.154
So while the above method works, I feel it is completely useless because of the file size restrictions. Progress bars are only useful for large files that take a long time to read in.
It would be great if someone could come up with something better than this solution!
Might I instead suggest speeding up the load (and save) times so that a progress bar isn't needed? If reading one matrix is "fast", you could then potentially report progress between each read matrix (if you have many).
Here's some measurements. By simply setting compress=FALSE, the load speed is doubled. But by writing a simple matrix serializer, the load speed is almost 20x faster.
x <- matrix(runif(1e7), 1e5) # Matrix with 100k rows and 100 columns
system.time( save('x', file='c:/foo.bin') ) # 13.26 seconds
system.time( load(file='c:/foo.bin') ) # 2.03 seconds
system.time( save('x', file='c:/foo.bin', compress=FALSE) ) # 0.86 seconds
system.time( load(file='c:/foo.bin') ) # 0.92 seconds
system.time( saveMatrix(x, 'c:/foo.bin') ) # 0.70 seconds
system.time( y <- loadMatrix('c:/foo.bin') ) # 0.11 seconds !!!
identical(x,y)
Where saveMatrix/loadMatrix are defined as follows. They don't currently handle dimnames and other attributes, but that could easily be added.
saveMatrix <- function(m, fileName) {
con <- file(fileName, 'wb')
on.exit(close(con))
writeBin(dim(m), con)
writeBin(typeof(m), con)
writeBin(c(m), con)
}
loadMatrix <- function(fileName) {
con <- file(fileName, 'rb')
on.exit(close(con))
d <- readBin(con, 'integer', 2)
type <- readBin(con, 'character', 1)
structure(readBin(con, type, prod(d)), dim=d)
}

Resources