shapes not aligned error when performing Singular Value Decomposition using scipy.sparse.linalg - sparse-matrix

I am trying to use Singular Value Decomposition (SVD) to predict missing values in a sparse matrix. Chapter 4 of the "Building Recommendation Engines in Python" Datacamp course provides an example of doing this with movie ratings, which is great. As a first step, I have been trying to replicate this Datacamp example on my local PC using Jupyter Notebook. However, when I try to multiply the U_Sigma and Vt matricies which are output from the "svds" function, I get an error:
ValueError: shapes (671,) and (6,9161) not aligned: 671 (dim 0) != 6 (dim 0)
I am using this dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset/version/7?select=ratings_small.csv
Here is the code I am trying to run:
import pandas as pd
filename = 'ratings_small.csv'
df = pd.read_csv(filename)
df.head()
user_ratings_df = df.pivot(index='userId', columns='movieId', values='rating')
# Get the average rating for each user
avg_ratings = user_ratings_df.mean(axis=1)
# Center each user's ratings around 0
user_ratings_centered = user_ratings_df.sub(avg_ratings, axis=1)
# Fill in all missing values with 0s
user_ratings_centered.fillna(0, inplace=True)
# Print the mean of each column
print(user_ratings_centered.mean(axis=1))
######################
# Import the required libraries
from scipy.sparse.linalg import svds
import numpy as np
# Decompose the matrix
U, sigma, Vt = svds(user_ratings_centered)
## Now that you have your three factor matrices, you can multiply them back together to get complete ratings data
# without missing values. In this exercise, you will use numpy's dot product function to multiply U and sigma first,
# then the result by Vt. You will then be able add the average ratings for each row to find your final ratings.
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)
# Dot product of result and Vt
U_sigma_Vt = np.dot(U_sigma, Vt)

There was a missing line of code. After running "svds" to decompose the matrix, we need this line:
# Convert sigma into a diagonal matrix
sigma = np.diag(sigma)

Related

R - apply function on each element of array in parallel

I have measurements of maximum and minimum temperature and precipitation that are organized as arrays of size
(100x96x50769), where i and j are grid cells with coordinates associated and z means the number of measurements over time.
Conceptually, it looks like this:
I am using the climdex.pcic package to calculate indices of extreme weather events. Given a time series of maximum and minimum temperature and precipitation, the climdexInput.raw function will return a climdexIput object that can be used to determine several indices: number of frost days, number of summer days, consecutive dry days etc.
The call for the function is pretty simple:
ci <- climdexInput.raw(tmax=x, tmin=y, prec=z,
t, t, t, base.range=c(1961,1990))
where x is a vector of maximum temperatures, y is a vector of minimum temperatures, z is a vector of precipitation and t is a vector with dates under which x, y and z were measured.
What I would like to do is to extract the timeseries for each element of my array (i.e. each grid cell in the figure above) and use it to run the climdexInput.raw function.
Because of the large number of elements of real data, I want to run this task in parallel on my 4-core Linux server. However, I have no experience with parallelization in R.
Here's one example of my program (with intentionally reduced dimensions to make execution faster on your computer):
library(climdex.pcic)
# Create some dates
t <- seq(as.Date('2000-01-01'), as.Date('2010-12-31'), 'day')
# Parse the dates into PCICt
t <- as.PCICt(strftime(t), cal='gregorian')
# Create some dummy weather data, with dimensions `# of lat`, `# of lon` and `# of timesteps`
nc.min <- array(runif(10*9*4018, min=0, max=15), c(10, 9, 4018))
nc.max <- array(runif(10*9*4018, min=25, max=40), c(10, 9, 4018))
nc.prc <- array(runif(10*9*4018, min=0, max=25), c(10, 9, 4018))
# Create "ci" object
ci <- climdexInput.raw(tmax=nc.max[1,1,], tmin=nc.min[1,1,], prec=nc.prc[1,1,],
t, t, t, base.range=c(2000,2005))
# Once you have “ci”, you can compute any of the indices provided by the climdex.pcic package.
# The example below is for cumulative # of dry days per year:
cdd <- climdex.cdd(ci, spells.can.span.years = TRUE)
Now, please note that in the example above I used only the first element of my array ([1,1,]) as an example in the climdexInput.raw function.
How can do the same for all elements taking advantage of parallel processing, possibly by looping over the dimensions i and j of my array?
You can use foreach to do that:
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
res <- foreach(j = seq_len(ncol(nc.min))) %:%
foreach(i = seq_len(nrow(nc.min))) %dopar% {
ci <- climdex.pcic::climdexInput.raw(
tmax=nc.max[i,j,],
tmin=nc.min[i,j,],
prec=nc.prc[i,j,],
t, t, t,
base.range=c(2000,2005)
)
}
stopCluster(cl)
See my guide on parallelism using foreach: https://privefl.github.io/blog/a-guide-to-parallelism-in-r/.
Then, to compute an index, just use climdex.cdd(res[[1]][[1]], spells.can.span.years = TRUE) (j first, i second).

Replacing specific column of an array with another array: Error Reason

I am trying to solve a set of linear equations which are solved recursively. At each time step, my solution is gamma having the shape of (3,1). This system is iteratively solved 20 times to get to the final value of gamma.
I am trying to store the values of gamma in each time in another array so that I can access the values of gamma at each step after the code run is complete. When I try storing the gamma value after each step into gamma_solution, it gives the following error:
SyntaxError: can't assign to function call
Where am I going wrong? Is there a better way to do this?
Thanks
Input code:
gamma_solution = np.zeros((3,#_of_steps))
for i in range(#_of_steps):
<code to solve a system of equations to give gamma as result>
gamma_solution[:,i].reshape((3,1)) = gamma
Output:
Error
Expectation: At each step i, store the value of gamma obtained in the step in the ith column​ of gamma_solution
Ok, gamma_solution shape is 3xN, gamma_solution[:, j] shape is (3,), so you need to transposegamma(that has shape (3, 1)) to store it in j-th column ofgamma_solution`. See the code below:
import numpy as np
N = 10
gamma_solution = np.zeros((3, N))
gamma = np.arange(3)[:, np.newaxis]
for j in range(N): # main loop where gamma values are computed
gamma_solution[:, j] = gamma.T

R e1071 trains faster than Libsvm

I am trying to use nu and epsilon SVR in library Libsvm in C after trying to use it in R (package e1071). But training time differs too much, in R it took less than a sec to generate my model, and in C it took more than 2 minutes. Is this a normal behavior?
In C I am running it like this:
./svm-train -s 3 -t 0 -q series.train series.model
I have tried giving more paramters, using more cache memory, and adding -fopenmp flag to compilation (it added about 10 extra secs to running time).
Any hint would be appreciated!
Edit:
My training file:
5.7367 1:1
5.46076 1:2
4.80722 1:3
4.80722 1:4
4.64745 1:5
4.66924 1:6
4.52401 1:7
4.76364 1:8
4.06652 1:9
4.03748 1:10
...
...
...
64.02734 1:1999
63.6241 1:2000
It is supposed to be a financial time series, the first column is the close price of the stock, and the first index is just a numerical value in increasing order (instead of the date).
R Code:
data<-read.csv("/home/manzha/series/GCARSO.csv",header=TRUE)
X <- c(1:2310) # 2310 is the total of rows
trainL<-2000
X_train <- c(1:trainL)
X_test <- c((trainL+1):length(X))
Y_test<-data$Adj.Close[(trainL+1):length(X)]
Y_train <- data$Adj.Close[1:trainL]
DF <- data.frame(x = X_train, y = Y_train)
model <- svm(y ~ x, data = DF, kernel= "linear", cost=2, epsilon=0.5, type="eps-regression")
predictedY <- predict(model, newdata= data.frame(x= X_test))
I have read some of the e1071 library doc and it scales the data automatically. I have tried scalling my data and it takes no time in training! But the results differ from R.
If I write:
model <- svm(y ~ x, data = DF, kernel= "linear", cost=2, epsilon=0.5, type="eps-regression",scale=FALSE)
R takes more time doing the training almost the same than C libsvm, and it gives the same result.
So for now, I think that my problem is how my training and test file after scaling are written.

Plot 3d surface map from data frame

I first begin by running the code below to tune a SVM:
tunecontrol <- tune.control(nrepeat=5, sampling = "fix",cross=5, performances=T)
tune_svm1 <- tune(svm,
Y ~ 1
+ X
, data = data,
ranges = list(epsilon = seq(epsilon_start
,epsilon_end
,(epsilon_end-epsilon_start)/10)
, cost = cost_start*(1:5)
, gamma = seq(gamma_start
,gamma_end
,(gamma_end - gamma_start)/5))
, tunecontrol=tunecontrol)
In tune_svm1$performances I have 330 observations containing all the values for epsilon, cost, and gamma that I stated in the ranges section of the above code as well as another column for the calculated error.
I'd like to generate a 3D surface plot for epsilon, cost, gamma, and error using three variables as X,Y,Z and the last for color. I've read on several resources for plot3d and persp but have had a lot of difficulty implementing.
If I try to follow the examples provided and use mesh to generate a mesh plot, I can only mesh together 3 of the 4 variables from tune_svm1$performances and saving the separate results for X,Y and Z as shown in the first link is difficult because the mesh is saved as an array, not a matrix. I've tried to hack a graph using the following code but the visual is nonsensical (probably because the order isn't being preserved by meshing each individually:
M1 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$cost[1:nrow(tune_svm1$performances)])
M2 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$gamma[1:nrow(tune_svm1$performances)])
M3 <- mesh(tune_svm1$performances$epsilon[1:nrow(tune_svm1$performances)]
,tune_svm1$performances$error[1:nrow(tune_svm1$performances)])
x <- M1$x ; y <- M1$y ; z <- M2$y ; c <- M3$y
surf3D(x,y,c, colvar = c)
What's the best way to approach this? Thank you.

Uniformly sampling on hyperplanes

Given the vector size N, I want to generate a vector <s1,s2, ..., sn> that s1+s2+...+sn = S.
Known 0<S<1 and si < S. Also such vectors generated should be uniformly distributed.
Any code in C that helps explain would be great!
The code here seems to do the trick, though it's rather complex.
I would probably settle for a simpler rejection-based algorithm, namely: pick an orthonormal basis in n-dimensional space starting with the hyperplane's normal vector. Transform each of the points (S,0,0,0..0), (0,S,0,0..0) into that basis and store the minimum and maximum along each of the basis vectors. Sample uniformly each component in the new basis, except for the first one (the normal vector), which is always S, then transform back to the original space and check if the constraints are satisfied. If they are not, sample again.
P.S. I think this is more of a maths question, actually, could be a good idea to ask at http://maths.stackexchange.com or http://stats.stackexchange.com
[I'll skip "hyper-" prefix for simplicity]
One of possible ideas: generate many uniformly distributed points in some enclosing volume and project them on the target part of plane.
To get uniform distribution the volume must be shaped like the part of plane but with added margins along plane normal.
To uniformly generate points in such volumewe can enclose it in a cube and reject everything outside of the volume.
select margin, let's take margin=S for simplicity (once margin is positive it affects only performance)
generate a point in cube [-M,S+M]x[-M,S+M]x[-M,S+M]
if distance to the plane is more than M, reject the point and go to #2
project the point on the plane
check that projection falls into [0,S]x[0,S]x[0,S], if not - reject and go to #2
add this point to the resulting set and go to #2 is you need more points
The problem can be mapped to that of sampling on linear polytopes for which the common approaches are Monte Carlo methods, Random Walks, and hit-and-run methods (see https://www.jmlr.org/papers/volume19/18-158/18-158.pdf for examples a short comparison). It is related to linear programming, and can be extended to manifolds.
There is also the analysis of polytopes in compositional data analysis, e.g. https://link.springer.com/content/pdf/10.1023/A:1023818214614.pdf, which provide an invertible transformation between the plane and the polytope that can be used for sampling.
If you are working on low dimensions, you can use also rejection sampling. This means you first sample on the plane containing the polytope (defined by your inequalities). This later method is easy to implement (and wasteful, of course), the GNU Octave (I let the author of the question re-implement in C) code below is an example.
The first requirement is to get vector orthogonal to the hyperplane. For a sum of N variables this is n = (1,...,1). The second requirement is a point on the plane. For your example that could be p = (S,...,S)/N.
Now any point on the plane satisfies n^T * (x - p) = 0
we assume also that x_i >= 0
With these given you compute an orthonormal basis on the plane (the nullity of the vector n) and then create random combination on that bases. Finally you map back to the original space and apply your constraints on the generated samples.
# Example in 3D
dim = 3;
S = 1;
n = ones(dim, 1); # perpendicular vector
p = S * ones(dim, 1) / dim;
# null-space of the perpendicular vector (transposed, i.e. row vector)
# this generates a basis in the plane
V = null (n.');
# These steps are just to reduce the amount of samples that are rejected
# we build a tight bounding box
bb = S * eye(dim); # each column is a corner of the constrained region
# project on the null-space
w_bb = V \ (bb - repmat(p, 1, dim));
wmin = min (w_bb(:));
wmax = max (w_bb(:));
# random combinations and map back
nsamples = 1e3;
w = wmin + (wmax - wmin) * rand(dim - 1, nsamples);
x = V * w + p;
# mask the points inside the polytope
msk = true(1, nsamples);
for i = 1:dim
msk &= (x(i,:) >= 0);
endfor
x_in = x(:, msk); # inside the polytope (your samples)
x_out = x(:, !msk); # outside the polytope
# plot the results
scatter3 (x(1,:), x(2,:), x(3,:), 8, double(msk), 'filled');
hold on
plot3(bb(1,:), bb(2,:), bb(3,:), 'xr')
axis image

Resources