I am programming a function in Cython which will be called many times. The majority of my code is "Cython" syntax C with the numerical operations performed using the GNU Scientific Library (GSL).
One thing that has interested me is there that is one particular operation which I am incapable of matching numpy's speed on. Please note that the performance gain here is not that important, rather the reason why numpy appears to be faster is interesting me.
The particular operation I refer to is this (the random numbers are dummy variables - I do not anticipate they make a difference):
import numpy as np
x = np.array([1.,2.])
Q = np.random.random((1000))
W = np.random.random((1000,2))
b = np.random.random((1000))
Q.dot(np.cos(W.dot(x) + b))
Timing the last line:
%timeit Q.dot(np.cos(W.dot(x) + b))
10000 loops, best of 3: 27 µs per loop
Numpy's speed here does not surprise me particularly. What I am curious about is why my cython version is almost twice as slow. Assuming variables are similarly initialised outside of the function, I perform the calculation in cython as follows (edit: full working example at the end of this post - removed some of the code here for tidiness):
gsl_blas_dgemv(CblasNoTrans,1,W,x,1,b)
for 0 <= i < n:
b.data[i*step] = cos(b.data[i*step])
gsl_blas_ddot(Q,b,&result)
Producing:
10000 loops, best of 3: 52.8 µs per loop
When I run that part of the function in isolation.
Cython should be linked with Atlas (I am doing my testing in a notebook with -L/usr/lib64/atlas -ltatlas as flags after the cell magic %%cython), and has an -O3 optimisation flag.
My questions are as follows:
Is my problem algorithmic? To the best of my knowledge the loop over the dot product between W and x to compute the cosine of the elements is necessary.
Should row or column major ordering of the arrays make a significant difference in this case?
Is numpy that much faster simply due to a more efficient back-end for these kind of operations, and if so, how does it accomplish this?
Am I naive to think this is a good measure of performance?
Is my Cython version simply poorly coded?
I'm very interested in learning where the speed difference comes from in this case.
Edit 1: Complete example of cython code
%%cython -lgsl -L/usr/lib64/atlas -ltatlas -lm
import Cython.Compiler.Options as CO
CO.extra_compile_args = ["-f",'-O3',"-ffast-math","-march=native"]
from libc.string cimport memcpy
from libc.math cimport cos
from cython_gsl cimport *
cimport cython
cdef gsl_vector * gsl_vector_create(double[:] v, int n):
cdef gsl_vector *u = gsl_vector_alloc(n)
u.data = &v[0]
return u
cdef gsl_matrix * gsl_matrix_create(double[:,:] A, int n, int m):
cdef gsl_matrix * B = gsl_matrix_alloc(n,m)
B.data = &A[0,0]
return B
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)
def test(double [:] x_, double[:] Q_, double[:,:] W_, double[:] b_, int d, int n):
cdef:
gsl_vector * x = gsl_vector_create(x_,d)
gsl_vector * Q = gsl_vector_create(Q_,n)
gsl_matrix * W = gsl_matrix_create(W_,n,d)
gsl_vector * b = gsl_vector_create(b_,n)
int i
gsl_blas_dgemv(CblasNoTrans,1,W,x,1,b)
cdef size_t step = b.stride
for 0 <= i < n:
b.data[i*step] = cos(b.data[i*step])
cdef double result
gsl_blas_ddot(Q,b,&result)
return result
Then in python:
x = np.array([1.,2.])
Q = np.random.random((1000))
W = np.random.random((1000,2))
b = np.random.random((1000))
%timeit Q.dot(np.cos(W.dot(x) + b))
%timeit test(x,Q,W,b,1000,2)
Edit 2
For the sake of completeness I have also tried compiling my cython function against GSL's Cblas. This is exactly the same as the above only with the cell magic flags
%%cython -lgsl -lgslcblas -lm
This is almost twice as slow as when compiled against atlas: 10000 loops, best of 3: 109 µs per loop.
Related
I'm trying to call np.random.choice, without replacement, row-by-row in a 2-D numpy array. I'm using Cython to get a speed boost. The code is only running a factor of 3 faster than a pure-python implementation, which is not a great result. The bottleneck is the numpy function call itself. When I comment it out, and just supply a static result of, say [3, 2, 1, 0] to each row, I get a factor of 1000 speedup (of course then it's not doing much of anything :)
My question: is there something I'm doing wrong in calling the numpy function that's causing it to go super slow? In theory it's C talking to C, so it should be fast. I looked at the compiled code, and the call to the numpy function looks complex, with statements like __Pyx_GOTREF and __Pyx_PyObject_GetAttrStr that lead me to believe it's using pure python in the process (bad!!).
My code:
# tag: numpy
import numpy as np
# compile-time info for numpy
cimport numpy as np
np.import_array()
# array dtypes
W_DTYPE = np.float
C_DTYPE = np.int
cdef int NUM_SELECTIONS = 4 # FIXME should be function kwarg
#compile-time dtypes
ctypedef np.float_t W_DTYPE_t
ctypedef np.int_t C_DTYPE_t
def allocate_choices(np.ndarray[W_DTYPE_t, ndim=2] round_weights,
np.ndarray[C_DTYPE_t, ndim=1] choice_labels):
"""
For ea. row in `round_weights` select NUM_SELECTIONS=4 items among
corresponding `choice_labels`, without replacement, with corresponding
probabilities in `round_weights`.
Args:
round_weights (np.ndarray): 2-d array of weights, w/
size [n_rounds, n_choices]
choice_labels (np.ndarray): 1-d array of choice labels,
w/ size [n_choices]; choices must be *INTEGERS*
Returns:
choices (np.ndarray): selected items per round, w/ size
[n_rounds, NUM_SELECTIONS]
"""
assert round_weights.dtype == W_DTYPE
assert choice_labels.dtype == C_DTYPE
assert round_weights.shape[1] == choice_labels.shape[0]
# initialize final choices array
cdef int n_rows = round_weights.shape[0]
cdef np.ndarray[C_DTYPE_t, ndim=2] choices = np.zeros([n_rows, NUM_SELECTIONS],
dtype=C_DTYPE)
# Allocate choices, per round
cdef int i, j
cdef bint replace = False
for i in range(n_rows):
choices[i] = np.random.choice(choice_labels,
NUM_SELECTIONS,
replace,
round_weights[i])
return choices
Update on this, after chatting with some folks and examining the compiled code: #DavidW's comment above put it well:
" In theory it's C talking to C, so it should be fast" - no. Not true.
The main bit of Numpy that cimport numpy gives direct access to is
just faster indexing of arrays. Numpy functions are called using the
normal Python mechanism. They may ultimately be implemented in C, but
that doesn't give a shortcut from Cython's point-of-view.
So this issue here is, calling this Numpy function requires translating the inputs back into python objects, passing them in, and then letting numpy do its thing. I don't think this is the case for all Numpy functions (from timing experiments, some of them I call work quite fast), but plenty are not "Cythonized".
I need to use a C library that gives me a function that takes as input a callback function. This callback function in turn takes an array and returns a value. So for example
double candidate(double[] x);
would be a valid callback.
I want to use Cython to implement a callback function, using Numpy to simplify the implementation.
So I am trying to implement a function
cdef double cythonCandidate(double[] x):
and now I would like to "cast" x as a numpy array immediately and then do operations using numpy.
For example, I might want to write something like:
cdef double euclideanNorm(double[] x):
# cast x into a numpy array nx here - dont know how!!
return np.sum(x * x)
Q1. How do I do this? How do I cast a C array into a numpy array without explicit copying, but just referencing the underlying buffer?
Q2: Is there python overhead in using numpy like I intend to?
For Q1:
%%cython -f
import numpy as np
def test_cast():
cdef double *x = [1, 2, 3, 4, 5]
cdef double[:1] x_view = <double[:5]>x # cast to memoryview, refer to the underlying buffer without copy
xarr = np.asarray(x_view) # numpy array refer to the underlying buffer without copy
x_view[0] = 100
xarr[1] = 200
x[2] = 300
print(xarr.flags) # OWNDATA flag should be False
return x[0],x[1],x[2],x[3],x[4] # (100.0, 200.0, 300.0, 4.0, 5.0)
Note: If you don't declare the x_view and do this xarr = np.asarray(<double[:5]>x), the cython compiler may crash with the error message:AttributeError: 'CythonScope' object has no attribute 'viewscope'.This can be fixed by from cython cimport view, for example:
%%cython -f
from cython cimport view # comment this line to see what will happen
import numpy as np
def test_error_cast():
cdef double *x = [1, 2, 3, 4, 5]
xarr = np.asarray(<double[:5]>x)
xarr[0] = 200
return x[0],x[1],x[2],x[3],x[4]
I don't know whether it's a feature or bug.
For Q2:
The numpy overhead shoud be significant when the array is small.See the benchmark below.
%%cython -a
from cython cimport view
import numpy as np
cdef inline double euclideanNorm(double *x, size_t x_size):
xarr = np.asarray(<double[:x_size]>x)
return np.sum(xarr*xarr)
cdef inline double euclideanNorm_c(double *x, size_t x_size):
cdef double ss = 0.0
cdef size_t i
for i in range(x_size):
ss += x[i] * x[i]
return ss
def c_norm(double[::1] x):
return euclideanNorm_c(&x[0], x.shape[0])
def np_norm(double[::1] x):
return euclideanNorm(&x[0], x.shape[0])
Small array in my PC:
import numpy as np
small_arr = np.random.rand(100)
print(c_norm(small_arr))
print(np_norm(small_arr))
%timeit c_norm(small_arr) # 1000000 loops, best of 3: 864 ns per loop
%timeit np_norm(small_arr) # 100000 loops, best of 3: 8.51 µs per loop
Big array in my PC:
big_arr = np.random.rand(1000000)
print(c_norm(big_arr))
print(np_norm(big_arr))
%timeit c_norm(big_arr) # 1000 loops, best of 3: 1.46 ms per loop
%timeit np_norm(big_arr) # 100 loops, best of 3: 4.93 ms per loop
I am working on C, using GNU library for scientific computing. Essentially, I need to do the equivalent of the following MATLAB code:
x=x.*(A*x);
where x is a gsl_vector, and A is a gsl_matrix.
I managed to do (A*x) with the following command:
gsl_blas_dgemv(CblasNoTrans, 1.0, A, x, 1.0, res);
where res is an another gsl_vector, which stores the result. If the matrix A has size m * m, and vector x has size m * 1, then vector res will have size m * 1.
Now, what remains to be done is the elementwise product of vectors x and res (the results should be a vector). Unfortunately, I am stuck on this and cannot find the function which does that.
If anyone can help me on that, I would be very grateful. In addition, does anyone know if there is some better documentation of GNU rather than https://www.gnu.org/software/gsl/manual/html_node/GSL-BLAS-Interface.html#GSL-BLAS-Interface which so far is confusing me.
Finally, would I lose in time performance if I do this step by simply using a for loop (the size of the vector is around 11000 and this step will be repeated 500-5000 times)?
for (i = 0; i < m; i++)
gsl_vector_set(res, i, gsl_vector_get(x, i) * gsl_vector_get(res, i));
Thanks!
The function you want is:
gsl_vector_mul(res, x)
I have used Intel's MKL, and I like the documentation on their website for these BLAS routines.
The for-loop is ok if GSL is well designed. For example gsl_vector_set() and gsl_vector_get() can be inlined. You could compare the running time with gsl_blas_daxpy. The for-loop is well optimized if the timing result is similar.
On the other hand, you may want to try a much better matrix library Eigen, with which you can implement your operation with the code similar to this
x = x.array() * (A * x).array();
here is the cython code i am trying to optimize,
import cython
cimport cython
from libc.stdlib cimport rand, srand, RAND_MAX
import numpy as np
cimport numpy as np
def genLoans(int loanid):
cdef int i, j, k
cdef double[:,:,:] loans = np.zeros((240, 20, 1000))
cdef double[:,:] aggloan = np.zeros((240, 20))
for j from 0<=j<1000:
srand(loanid*1000+j)
for i from 0<=i<240:
for k from 0<=k<20:
loans[i,k,j] = rand()
###some other logics
aggloan[i,k] += loans[i,k,j]/1000
return aggloan
cython -a shows
I guess when I trying to initialize zero array loans and aggloan, numpy slows me down. Yet i need to run 5000+ loans. Just wondering if there is other ways to avoid using numpy when i define 3d/2d and return arrays...
The yellow part is because of the Numpy call, where you allocate the array. What you can do is pass these arrays as arguments to the function, and reuse them from one to the next.
Also, I see you are rewriting all the elements, so you are claiming memory, writing it with zeroes, and then putting in your numbers. If you are sure you are overwriting all the elements, you can use np.empty, that will not initialize the variables.
Note: Linux kernel has a specific way of allocating memory initialised to 0, that is faster that any other value, and modern Numpy can use it, but it is still slower than empty:
In [4]: %timeit np.zeros((100,100))
100000 loops, best of 3: 4.04 µs per loop
In [5]: %timeit np.ones((100,100))
100000 loops, best of 3: 8.99 µs per loop
In [6]: %timeit np.empty((100,100))
1000000 loops, best of 3: 917 ns per loop
Last but not least, are you sure this is your bottleneck? I don't know what processing are you doing, but yellow is the number of lines of C code, not time. Anyway, from the timings, using empty should speed up that by a factor of four. If you want more, post the rest of your code at CR.
Edit:
Expanding on my second sentence: your function signature can be
def genLoans(int loanid, cdef double[:,:,:] loans, cdef double[:,:] aggloan):
You initialize the arrays before your loop, and just pass them again and again.
In any case, in my machine (Linux Intel i5), it takes 9µs, so you are spending a total of 45 ms. This is definitely not your bottleneck. Profile!
Imagine for instance we have the following functions:
f = #(n) sin((0:1e-3:1) .* n * pi);
g = #(n, t) cos(n .^ 2 * pi ^2 / 2 .* t);
h = #(n) f(n) * g(n, 0);
Now, I would like to be able to enter an array of values for n into h and return a sum of the results for each value of n.
I am trying to be efficient, so I am avoiding the novice for-loop method of just filling out a pre-allocated matrix and summing down the columns. I also tried using arrayfun and converting the cell to a matrix then summing that, but it ended up being a slower process than the for-loop.
Does anyone know how I might do this?
The fact is the "novice" for-loop is going to be competitively as fast as any other vectorized solution thanks to improvements in JIT compilation in recent versions of MATLAB.
% array of values of n
len = 500;
n = rand(len,1);
% preallocate matrix
X = zeros(len,1001);
% fill rows
for i=1:len
X(i,:) = h(n(i)); % call function handle
end
out = sum(X,1);
The above is as fast as (maybe even faster):
XX = cell2mat(arrayfun(h, n, 'UniformOutput',false));
out = sum(XX,1);
EDIT:
Here it is computed directly without function handles in a single vectorized call:
n = rand(len,1);
t = 0; % or any other value
out = sum(bsxfun(#times, ...
sin(bsxfun(#times, n, (0:1e-3:1)*pi)), ...
cos(n.^2 * t * pi^2/2)), 1);