here is the cython code i am trying to optimize,
import cython
cimport cython
from libc.stdlib cimport rand, srand, RAND_MAX
import numpy as np
cimport numpy as np
def genLoans(int loanid):
cdef int i, j, k
cdef double[:,:,:] loans = np.zeros((240, 20, 1000))
cdef double[:,:] aggloan = np.zeros((240, 20))
for j from 0<=j<1000:
srand(loanid*1000+j)
for i from 0<=i<240:
for k from 0<=k<20:
loans[i,k,j] = rand()
###some other logics
aggloan[i,k] += loans[i,k,j]/1000
return aggloan
cython -a shows
I guess when I trying to initialize zero array loans and aggloan, numpy slows me down. Yet i need to run 5000+ loans. Just wondering if there is other ways to avoid using numpy when i define 3d/2d and return arrays...
The yellow part is because of the Numpy call, where you allocate the array. What you can do is pass these arrays as arguments to the function, and reuse them from one to the next.
Also, I see you are rewriting all the elements, so you are claiming memory, writing it with zeroes, and then putting in your numbers. If you are sure you are overwriting all the elements, you can use np.empty, that will not initialize the variables.
Note: Linux kernel has a specific way of allocating memory initialised to 0, that is faster that any other value, and modern Numpy can use it, but it is still slower than empty:
In [4]: %timeit np.zeros((100,100))
100000 loops, best of 3: 4.04 µs per loop
In [5]: %timeit np.ones((100,100))
100000 loops, best of 3: 8.99 µs per loop
In [6]: %timeit np.empty((100,100))
1000000 loops, best of 3: 917 ns per loop
Last but not least, are you sure this is your bottleneck? I don't know what processing are you doing, but yellow is the number of lines of C code, not time. Anyway, from the timings, using empty should speed up that by a factor of four. If you want more, post the rest of your code at CR.
Edit:
Expanding on my second sentence: your function signature can be
def genLoans(int loanid, cdef double[:,:,:] loans, cdef double[:,:] aggloan):
You initialize the arrays before your loop, and just pass them again and again.
In any case, in my machine (Linux Intel i5), it takes 9µs, so you are spending a total of 45 ms. This is definitely not your bottleneck. Profile!
Related
I'm trying to call np.random.choice, without replacement, row-by-row in a 2-D numpy array. I'm using Cython to get a speed boost. The code is only running a factor of 3 faster than a pure-python implementation, which is not a great result. The bottleneck is the numpy function call itself. When I comment it out, and just supply a static result of, say [3, 2, 1, 0] to each row, I get a factor of 1000 speedup (of course then it's not doing much of anything :)
My question: is there something I'm doing wrong in calling the numpy function that's causing it to go super slow? In theory it's C talking to C, so it should be fast. I looked at the compiled code, and the call to the numpy function looks complex, with statements like __Pyx_GOTREF and __Pyx_PyObject_GetAttrStr that lead me to believe it's using pure python in the process (bad!!).
My code:
# tag: numpy
import numpy as np
# compile-time info for numpy
cimport numpy as np
np.import_array()
# array dtypes
W_DTYPE = np.float
C_DTYPE = np.int
cdef int NUM_SELECTIONS = 4 # FIXME should be function kwarg
#compile-time dtypes
ctypedef np.float_t W_DTYPE_t
ctypedef np.int_t C_DTYPE_t
def allocate_choices(np.ndarray[W_DTYPE_t, ndim=2] round_weights,
np.ndarray[C_DTYPE_t, ndim=1] choice_labels):
"""
For ea. row in `round_weights` select NUM_SELECTIONS=4 items among
corresponding `choice_labels`, without replacement, with corresponding
probabilities in `round_weights`.
Args:
round_weights (np.ndarray): 2-d array of weights, w/
size [n_rounds, n_choices]
choice_labels (np.ndarray): 1-d array of choice labels,
w/ size [n_choices]; choices must be *INTEGERS*
Returns:
choices (np.ndarray): selected items per round, w/ size
[n_rounds, NUM_SELECTIONS]
"""
assert round_weights.dtype == W_DTYPE
assert choice_labels.dtype == C_DTYPE
assert round_weights.shape[1] == choice_labels.shape[0]
# initialize final choices array
cdef int n_rows = round_weights.shape[0]
cdef np.ndarray[C_DTYPE_t, ndim=2] choices = np.zeros([n_rows, NUM_SELECTIONS],
dtype=C_DTYPE)
# Allocate choices, per round
cdef int i, j
cdef bint replace = False
for i in range(n_rows):
choices[i] = np.random.choice(choice_labels,
NUM_SELECTIONS,
replace,
round_weights[i])
return choices
Update on this, after chatting with some folks and examining the compiled code: #DavidW's comment above put it well:
" In theory it's C talking to C, so it should be fast" - no. Not true.
The main bit of Numpy that cimport numpy gives direct access to is
just faster indexing of arrays. Numpy functions are called using the
normal Python mechanism. They may ultimately be implemented in C, but
that doesn't give a shortcut from Cython's point-of-view.
So this issue here is, calling this Numpy function requires translating the inputs back into python objects, passing them in, and then letting numpy do its thing. I don't think this is the case for all Numpy functions (from timing experiments, some of them I call work quite fast), but plenty are not "Cythonized".
I am programming a function in Cython which will be called many times. The majority of my code is "Cython" syntax C with the numerical operations performed using the GNU Scientific Library (GSL).
One thing that has interested me is there that is one particular operation which I am incapable of matching numpy's speed on. Please note that the performance gain here is not that important, rather the reason why numpy appears to be faster is interesting me.
The particular operation I refer to is this (the random numbers are dummy variables - I do not anticipate they make a difference):
import numpy as np
x = np.array([1.,2.])
Q = np.random.random((1000))
W = np.random.random((1000,2))
b = np.random.random((1000))
Q.dot(np.cos(W.dot(x) + b))
Timing the last line:
%timeit Q.dot(np.cos(W.dot(x) + b))
10000 loops, best of 3: 27 µs per loop
Numpy's speed here does not surprise me particularly. What I am curious about is why my cython version is almost twice as slow. Assuming variables are similarly initialised outside of the function, I perform the calculation in cython as follows (edit: full working example at the end of this post - removed some of the code here for tidiness):
gsl_blas_dgemv(CblasNoTrans,1,W,x,1,b)
for 0 <= i < n:
b.data[i*step] = cos(b.data[i*step])
gsl_blas_ddot(Q,b,&result)
Producing:
10000 loops, best of 3: 52.8 µs per loop
When I run that part of the function in isolation.
Cython should be linked with Atlas (I am doing my testing in a notebook with -L/usr/lib64/atlas -ltatlas as flags after the cell magic %%cython), and has an -O3 optimisation flag.
My questions are as follows:
Is my problem algorithmic? To the best of my knowledge the loop over the dot product between W and x to compute the cosine of the elements is necessary.
Should row or column major ordering of the arrays make a significant difference in this case?
Is numpy that much faster simply due to a more efficient back-end for these kind of operations, and if so, how does it accomplish this?
Am I naive to think this is a good measure of performance?
Is my Cython version simply poorly coded?
I'm very interested in learning where the speed difference comes from in this case.
Edit 1: Complete example of cython code
%%cython -lgsl -L/usr/lib64/atlas -ltatlas -lm
import Cython.Compiler.Options as CO
CO.extra_compile_args = ["-f",'-O3',"-ffast-math","-march=native"]
from libc.string cimport memcpy
from libc.math cimport cos
from cython_gsl cimport *
cimport cython
cdef gsl_vector * gsl_vector_create(double[:] v, int n):
cdef gsl_vector *u = gsl_vector_alloc(n)
u.data = &v[0]
return u
cdef gsl_matrix * gsl_matrix_create(double[:,:] A, int n, int m):
cdef gsl_matrix * B = gsl_matrix_alloc(n,m)
B.data = &A[0,0]
return B
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)
def test(double [:] x_, double[:] Q_, double[:,:] W_, double[:] b_, int d, int n):
cdef:
gsl_vector * x = gsl_vector_create(x_,d)
gsl_vector * Q = gsl_vector_create(Q_,n)
gsl_matrix * W = gsl_matrix_create(W_,n,d)
gsl_vector * b = gsl_vector_create(b_,n)
int i
gsl_blas_dgemv(CblasNoTrans,1,W,x,1,b)
cdef size_t step = b.stride
for 0 <= i < n:
b.data[i*step] = cos(b.data[i*step])
cdef double result
gsl_blas_ddot(Q,b,&result)
return result
Then in python:
x = np.array([1.,2.])
Q = np.random.random((1000))
W = np.random.random((1000,2))
b = np.random.random((1000))
%timeit Q.dot(np.cos(W.dot(x) + b))
%timeit test(x,Q,W,b,1000,2)
Edit 2
For the sake of completeness I have also tried compiling my cython function against GSL's Cblas. This is exactly the same as the above only with the cell magic flags
%%cython -lgsl -lgslcblas -lm
This is almost twice as slow as when compiled against atlas: 10000 loops, best of 3: 109 µs per loop.
I need to use a C library that gives me a function that takes as input a callback function. This callback function in turn takes an array and returns a value. So for example
double candidate(double[] x);
would be a valid callback.
I want to use Cython to implement a callback function, using Numpy to simplify the implementation.
So I am trying to implement a function
cdef double cythonCandidate(double[] x):
and now I would like to "cast" x as a numpy array immediately and then do operations using numpy.
For example, I might want to write something like:
cdef double euclideanNorm(double[] x):
# cast x into a numpy array nx here - dont know how!!
return np.sum(x * x)
Q1. How do I do this? How do I cast a C array into a numpy array without explicit copying, but just referencing the underlying buffer?
Q2: Is there python overhead in using numpy like I intend to?
For Q1:
%%cython -f
import numpy as np
def test_cast():
cdef double *x = [1, 2, 3, 4, 5]
cdef double[:1] x_view = <double[:5]>x # cast to memoryview, refer to the underlying buffer without copy
xarr = np.asarray(x_view) # numpy array refer to the underlying buffer without copy
x_view[0] = 100
xarr[1] = 200
x[2] = 300
print(xarr.flags) # OWNDATA flag should be False
return x[0],x[1],x[2],x[3],x[4] # (100.0, 200.0, 300.0, 4.0, 5.0)
Note: If you don't declare the x_view and do this xarr = np.asarray(<double[:5]>x), the cython compiler may crash with the error message:AttributeError: 'CythonScope' object has no attribute 'viewscope'.This can be fixed by from cython cimport view, for example:
%%cython -f
from cython cimport view # comment this line to see what will happen
import numpy as np
def test_error_cast():
cdef double *x = [1, 2, 3, 4, 5]
xarr = np.asarray(<double[:5]>x)
xarr[0] = 200
return x[0],x[1],x[2],x[3],x[4]
I don't know whether it's a feature or bug.
For Q2:
The numpy overhead shoud be significant when the array is small.See the benchmark below.
%%cython -a
from cython cimport view
import numpy as np
cdef inline double euclideanNorm(double *x, size_t x_size):
xarr = np.asarray(<double[:x_size]>x)
return np.sum(xarr*xarr)
cdef inline double euclideanNorm_c(double *x, size_t x_size):
cdef double ss = 0.0
cdef size_t i
for i in range(x_size):
ss += x[i] * x[i]
return ss
def c_norm(double[::1] x):
return euclideanNorm_c(&x[0], x.shape[0])
def np_norm(double[::1] x):
return euclideanNorm(&x[0], x.shape[0])
Small array in my PC:
import numpy as np
small_arr = np.random.rand(100)
print(c_norm(small_arr))
print(np_norm(small_arr))
%timeit c_norm(small_arr) # 1000000 loops, best of 3: 864 ns per loop
%timeit np_norm(small_arr) # 100000 loops, best of 3: 8.51 µs per loop
Big array in my PC:
big_arr = np.random.rand(1000000)
print(c_norm(big_arr))
print(np_norm(big_arr))
%timeit c_norm(big_arr) # 1000 loops, best of 3: 1.46 ms per loop
%timeit np_norm(big_arr) # 100 loops, best of 3: 4.93 ms per loop
I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin
I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.
So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.
I got a piece of code here and I can't seem to figure out an efficient way of converting this piece of code to the Fortran 95 equivalent. I have tried several things already, but I'm always stuck on making 1D arrays from matrices and the other way around (the point is to reduce calculation time, and if I convert them, I can't think of another way than using loops again :/).
This is the piece of code:
do i=1,dim
do j=1,dim
Snorm(i,j)=Sval(j)/Sval(i)
Bnorm(i,j)=Bval(j)/Bval(i)
Pnorm(i,j)=Pval(j)/Pval(i)
enddo
enddo
How would you write that in Fortran95 code?
The equivalent of the matrix calculations in R is this:
Snorm <- t(Sval %*% t(1/Sval))
Bnorm <- t(Bval %*% t(1/Bval))
Pnorm <- t(Pval %*% t(1/Pval))
The equivalent of it in Python is this:
Snorm = (numpy.dot((Svalmat.T),(1/Svalmat))).T
Bnorm = (numpy.dot((Bvalmat.T),(1/Bvalmat))).T
Pnorm = (numpy.dot((Pvalmat.T),(1/Pvalmat))).T
with Svalmat etc the equivalent of Sval, but then columnmatrix
Anyone has an idea?
It is not worth changing in my opinion. It is valid Fortran 95. Especially if your goal is the calculation time. Any "clever" tricks with subarrays can introduce array temporaries.
The obvious try is forall or do concurrent
forall(i=1:dim, j=1:dim)
Snorm(i,j)=Sval(j)/Sval(i)
Bnorm(i,j)=Bval(j)/Bval(i)
Pnorm(i,j)=Pval(j)/Pval(i)
end forall
and the same with do concurrent.
Notice, that your original order of the loops is probably not efficient.