I'm trying to fast solve equations (of the form x%*%res = y) for a large array in R CRAN.
I have the data x and y and want to compute res.
How can this be done best, i.e., fast? Thanks a lot!
Here is an example and some approaches: (seems like "solve" is the fastest?)
# setup:
p = 20 # dimension of matrix to solve
nmkt= 3000 # length of array, i.e., number of equations to solve
res = matrix(0,p,nmkt) # result matrix
x = array(rnorm(p*p*nmkt),c(p,p,nmkt)) # data
# make x symetric and invertible
for(i in 1:nmkt){ x[, , i]= crossprod(x[, , i])+diag(p)*0.01}
y = matrix(rnorm(p*nmkt),nmkt,p) # data
# computation and test:
R=100 # number of replications (actually much larger than 100 in my application R=1e5 or 1e7)
system.time(for(r in 1:R){ for(i in 1:nmkt){res[,i] = qr.solve(x[, , i], y[i,], tol = 1e-7)}})
system.time(for(r in 1:R){ for(i in 1:nmkt){res[,i] = solve(x[, , i], y[i,], tol = 1e-7)}})
system.time(for(r in 1:R){ for(i in 1:nmkt){res[,i] = crossprod( chol2inv(chol( x[, , i] )) , y[i,] )}})
Is the loop through the array a good solution?
Or use a sparse matrix? :
require(Matrix)
j = c(matrix(1:(p*nmkt),p,p*nmkt,byrow=TRUE))
i = c(aperm( array(j,c(p,p,nmkt)), c(2,1,3)))
system.time(for(r in 1:R){ res= solve(sparseMatrix(i=i, j=j, x = c(x)), c(t(y)), tol = 1e-7)} )
Related
I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.
I have came up with a very simple program to show this:
############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython
def func1(long N):
cdef double sum1,sum2,sum3
cdef long i
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
def func2(long N):
cdef np.ndarray[np.float64_t,ndim=1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
def func3(long N):
cdef double sum_arr[3]
cdef long i
sum_arr[0] = 0.0
sum_arr[1] = 0.0
sum_arr[2] = 0.0
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
##########################################
################## test.py ###############
import time
import test as test
N = 1000000000
for i in xrange(10):
start = time.time()
sum1,sum2,sum3 = test.func1(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func2(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func3(N)
print 'Time taken = %.3f'%(time.time()-start)
############################################
And from python test.py I get:
Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418
Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595
Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418
My question is: why func2 is almost 2x slower than func1 and func3?
Is there a way to improve this?
Thanks!
######## UPDATE
My real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:
for i in xrange(N):
corr1 = h(i,val)
for j in xrange(N):
corr2 = h(j,val)
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
where val is a property of the 3D array represented by a number. This number can be different for different fields.
Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:
If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:
for i in xrange(N):
for p in xrange(num_3D_arrays):
corr1[p] = h(i,val[p])
for j in xrange(N):
for p in xrange(num_3D_arrays):
corr2[p] = h(j,val[p])
for k in xrange(N):
for p in xrange(num_3D_arrays):
corr3[p] = h(k,val[p])
B = g(i,j,k)
for p in xrange(num_3D_arrays):
A[p] = f(P[i,j,k])
Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]
So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.
#
There are a couple things you can do to speed up array indexing in Cython:
Turn of bounds checking and wraparound.
Use typed memoryviews.
Declare the array is contiguous.
So for your function:
#cython.boundscheck(False)
#cython.wraparound(False)
def func2(long N):
cdef np.float64_t[::1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
For the original code Cython produced the following C code for the line sum_arr[0] += i:
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
With the improvements above:
__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;
why func2 is almost 2x slower than func1?
It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with
sum=array([sum1,sum2,sum3])
How to speed python code ?
Numpy is the first good idea , It raise nearly C speed with no effort.
Numba can fill the gap with no effort too, and is very simple.
Cython for critical cases.
Here some illustration of that:
# python way
def func1(N):
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
# numpy way
def func2(N):
aran=arange(float(N))
sum1=aran.sum()
sum2=(2.0*aran).sum()
sum3=(3.0*aran).sum()
return sum1,sum2,sum3
#numba way
import numba
func3 =numba.njit(func1)
"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop
In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop
In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""
Look at the html produced by cython -a ...pyx.
For func1, the sum1 += i line expands to :
+15: sum1 += i
__pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);
for func3, with a C array
+45: sum_arr[0] += i
__pyx_t_3 = 0;
(__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);
Slightly more complicated, but a straight forward c.
But for func2:
+29: sum_arr[0] += i
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 29, __pyx_L1_error)
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
Much more complicated with references to numpy functions (e.g. Pyx_BUfPtrStrided1d). Even initializing the array is complicated:
+26: sum_arr = np.zeros(3,dtype=np.float64)
__pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
....
I expect that moving the sum_arr creation to the calling Python, and passing it as an argument to func2 would save some time.
Have you read this guide for using memoryviews:
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
You'll get the best cython performance if you focus on writing the low level operations so they translate into simple c. In
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
It's not the loops on i,j,k that will slow you down. It's evaluating h, f, and g each time, as well as the Arr[B] +=.... Those functions should be tightly coded cython, not general Python functions. Look at the compiled simplicity of the sum3d function in the memoryview guide.
I'm writing a code to solve Ax=b using MATLAB's x=A\B. I believe my problem lies within getting the data from the files into the array. Right now, the solution vector is coming out to be a load of 0's
The matrices I'm using have 10 rows respectively. They are aligned correctly in the text files.
% solve a linear system Ax = b by reading A and b from input file
% and then writing x on output file.
clear;
clc;
input_filename = 'my_input.txt';
output_filename = 'my_output.txt';
% read data from file
fileID = fopen('a_matrix.txt', 'r');
formatSpec = '%d %f';
sizeA = [10 Inf];
A = load('b_matrix.txt');
A = A'
file2ID = fopen('b_matrix.txt','r');
formatSpec2 = '%d %f';
sizeB = [10 Inf];
b = load('b_matrix.txt');
fclose(file2ID);
b = b'
% solve the linear system
x = A\b;
% write output data on file
dlmwrite('my_output.txt',x,'delimiter',',','precision',4);
% print screen
fprintf('Solution vector is: \n');
fprintf('%4.2f \n', x);
I answered my own question but I felt the need to share in case anyone else has similar troubles.
% solve a linear system Ax = b by reading A and b from input file
% and then writing x on output file.
clear;
clc;
input_filename = 'my_input.txt';
output_filename = 'my_output.txt';
% read data from file
f = textread('a_matrix.txt', '%f');
vals = reshape(f, 11, []).';
A = vals(:,1:10);
b = vals(:,11);
% solve the linear system
x = A\b;
% write output data on file
dlmwrite('my_output.txt',x,'delimiter',',','precision',4);
% print screen
fprintf('Solution vector is: \n');
fprintf('%4.2f \n', x);
I ended up combining the 'a' and 'b' matrix into a single text file for simplicity. Now, MATLAB reads data in by columns, so it is necessary to use 'reshape' in order to fit the data within the array correctly. Then, I filtered out the information from the single matrix by columns, using the 'vals' function as seen in my code. The 'A' matrix is essentially all numbers in columns 1 through 10, while the 'B' matrix is the 11th (and final) column.
Using MATLAB's x=A\b function, I was able to solve the linear system of equations.
I have a problem where i can't add values to my 1 x 250 matrix directly from a variable. This is the code.
COMPORT = 'COM4';
BAUDRATE = 115200;
s1 = serial(COMPORT, 'baudrate', BAUDRATE);
set(s1, 'Terminator', 10);
fopen(s1);
adc = 0;
N = 250;
values = zeros(1, N);
for n = 1:N
adc = fscanf(s1);
values(n) = adc;
flushinput(s1);
flushoutput(s1);
end
x = linspace(0, 250);
plot(x, n);
The values(n) = adc does not seem to work and i don't know how to work my way around it.
This doesn't work because values(n) is a single element and the output from fscanf(s1) consist of several elements.
Maybe you want to use cells?
values{n} = adc;
Substitute the pre-allocation: n = zeros(1, N) with n = cell(1,N);.
Notice that you need to do some changes later in your code. I'll leave that uo to you.
I am trying to write a matlab function to solve a test integral using the Metropolis Method. My function is listed below.
The integral is from 0 to infinity of x*e(-x^2), divided by the integral from 0 to infinity of e(-x^2)
This function converges to ~0.5 (notably, it does fluctuate about this answer a little) however analytically the solution is ~0.5642 or 1/sqrt(pi).
The code I use to run the function is also below.
What I have done wrong? How do I use metropolis to correct solve this test function?
% Metropolis Method for Integration
% Written by John Furness - Computational Physics, KTH
function [I S1 S2] = metropolis(f,a,b,n,sig)
% This function calculates an integral using Metropolis Mathod
% Only takes input as a function, f, on an interval between a and b,
% where n is the number of points.
%Defining burnin
%burnin = n/20;
burnin = 0;
% Finding maximum point
x = linspace(a,b,1000);
f1 = f(x);
max1 = max(f1);
%Setting Up x-vector and mu
x(1) = rand(1);
mu=0;
% Generating Random Poins for x with Gaussian distribution.
% Proposal Distribution will be the normal distribution
strg = 'exp(-1*((x-mu)/sig).^2)';
norm = inline(strg,'x','mu','sig');
for i = 2:n
% This loop generates a new state from the proposal distribution.
y = x(i-1) + sig*randn(1);
% generate a uniform for comparison
u = rand(1);
% Alpha is the acceptance probability
alpha = min([1, (f(y))/((f(x(i-1))))]);
if u <= alpha
x(i) = y;
else
x(i) = x(i-1);
end
end
%Discarding Burnin
%x(1:burnin) = [];
%I = ((inside)/length(x))*max1*(b-a);
I = (1/length(f(x)))*((sum(f(x))))/sum(norm(x,mu,sig));
%My investigation variables to see what's happening
%S1 = sum(f(x));
%S2 = sum(norm1(x,mu,sig));
S1 = min(x);
S2 = max(x);
end
Code used to run the above function:
% Code for Running Metropolis Method
% Written by John Furness - Computational Physics
% Clearing Workspace
clear all
close all
clc
% Equation 1
% Changing Parameters for Equation 1
a1 = 0;
b1 = 10;
n1 = 10000;
sig = 2;
N1 = #(x)(x.*exp(-x.^2));
D1 = #(x)(exp(-x.^2));
denom = metropolis(D1,a1,b1,n1,sig);
numer = metropolis(N1,a1,b1,n1,sig);
solI1 = numer/denom
What I want to achieve when doing divide([1,2], 3, X). is something like:
I Should just get all the permutations of the first list, divided over N lists.
X = [[],[],[1,2]] ;
X = [[],[],[2,1]] ;
X = [[],[2],[1]] ;
X = [[],[1],[2]] ;
X = [[],[1,2],[]] ;
X = [[],[2,1],[]] ;
X = [[],[],[2,1]] ;
X = [[],[],[1,2]] ;
X = [[],[1],[2]] ;
X = [[],[2],[1]] ;
X = [[],[2,1],[]] ;
X = [[],[1,2],[]] ;
X = [[2],[],[1]] ;
X = [[2],[1],[]] ;
X = [[1],[],[2]] ;
X = [[1],[2],[]] ;
X = [[1,2],[],[]] ;
X = [[2,1],[],[]] ;
but for some reason, if my list is longer than 2 items, the code below goes into a loop and shows way too much information.
% Divides a list over N sets
divide(_,N,[]) :- N < 1.
divide(Items,1,[Items]).
divide(Items,N,[Selected|Other]) :- N > 1,
sublistPerm(Items,Selected,Rest),
N1 is N-1,
divide(Rest,N1,Other).
the sublistPerm works as it should (you can test it if you want).
% Gets all power sets of a list and permutes them
sublistPerm(Items, Sel, Rest) :- sublist(Items, Temp1, Temp2),
permutation(Temp1, Sel),
permutation(Temp2, Rest).
% Gets all power sets of a list
sublist([], [], []).
sublist([X|XS], YS, [X|ZS]) :- sublist(XS, YS, ZS).
sublist([X|XS], [X|YS], ZS) :- sublist(XS, YS, ZS).
If you would do the effort of running the following code, you will see the redundant info that I am getting. I have ABSOLUTELY no idea why it doesn't just terminate, as it should. divide([1,2,3], 3, X).
As you can see in my example, there are no duplicates. Normally these won't occur, and if they occur, duplicates should be removed.
Thanks for anyone pointing me in the right direction.
There are several issues with your code, looping is none of them. We can set that issue apart very quickly:
?- divide([1,2], 3, X), false.
This terminates. No termination issues with this query.
There are some redundant solutions. But again this is not really an issue. However, what is most problematic is that your relation is incomplete. The minimal example is:
?- divide([1,2], 1, [[2,1]]).
which should succeed but fails. So let's attack this issue first. The fact
divide(Items,1,[Items]).
has to be generalized to cover all permutations.
divide(Items,1,[ItemsP]) :-
permutation(Items, ItemsP).
For the redundant answers/solutions the second goal permutation/2 is not needed, you can replace it by (=)/2 or rewrite your program accordingly.