Calling a numpy function in Cython really slows things down

Calling a numpy function in Cython really slows things down - arrays

I'm trying to call np.random.choice, without replacement, row-by-row in a 2-D numpy array. I'm using Cython to get a speed boost. The code is only running a factor of 3 faster than a pure-python implementation, which is not a great result. The bottleneck is the numpy function call itself. When I comment it out, and just supply a static result of, say [3, 2, 1, 0] to each row, I get a factor of 1000 speedup (of course then it's not doing much of anything :)
My question: is there something I'm doing wrong in calling the numpy function that's causing it to go super slow? In theory it's C talking to C, so it should be fast. I looked at the compiled code, and the call to the numpy function looks complex, with statements like __Pyx_GOTREF and __Pyx_PyObject_GetAttrStr that lead me to believe it's using pure python in the process (bad!!).
My code:
# tag: numpy
import numpy as np
# compile-time info for numpy
cimport numpy as np
np.import_array()
# array dtypes
W_DTYPE = np.float
C_DTYPE = np.int
cdef int NUM_SELECTIONS = 4 # FIXME should be function kwarg
#compile-time dtypes
ctypedef np.float_t W_DTYPE_t
ctypedef np.int_t C_DTYPE_t
def allocate_choices(np.ndarray[W_DTYPE_t, ndim=2] round_weights,
np.ndarray[C_DTYPE_t, ndim=1] choice_labels):
"""
For ea. row in `round_weights` select NUM_SELECTIONS=4 items among
corresponding `choice_labels`, without replacement, with corresponding
probabilities in `round_weights`.
Args:
round_weights (np.ndarray): 2-d array of weights, w/
size [n_rounds, n_choices]
choice_labels (np.ndarray): 1-d array of choice labels,
w/ size [n_choices]; choices must be *INTEGERS*
Returns:
choices (np.ndarray): selected items per round, w/ size
[n_rounds, NUM_SELECTIONS]
"""
assert round_weights.dtype == W_DTYPE
assert choice_labels.dtype == C_DTYPE
assert round_weights.shape[1] == choice_labels.shape[0]
# initialize final choices array
cdef int n_rows = round_weights.shape[0]
cdef np.ndarray[C_DTYPE_t, ndim=2] choices = np.zeros([n_rows, NUM_SELECTIONS],
dtype=C_DTYPE)
# Allocate choices, per round
cdef int i, j
cdef bint replace = False
for i in range(n_rows):
choices[i] = np.random.choice(choice_labels,
NUM_SELECTIONS,
replace,
round_weights[i])
return choices

Update on this, after chatting with some folks and examining the compiled code: #DavidW's comment above put it well:
" In theory it's C talking to C, so it should be fast" - no. Not true.
The main bit of Numpy that cimport numpy gives direct access to is
just faster indexing of arrays. Numpy functions are called using the
normal Python mechanism. They may ultimately be implemented in C, but
that doesn't give a shortcut from Cython's point-of-view.
So this issue here is, calling this Numpy function requires translating the inputs back into python objects, passing them in, and then letting numpy do its thing. I don't think this is the case for all Numpy functions (from timing experiments, some of them I call work quite fast), but plenty are not "Cythonized".

Related

Solve systems of ODEs in Julia with initial condition as 2D array

http://diffeq.sciml.ai/latest/tutorials/ode_example.html
I am trying to use the ODE solver in Julia (DifferentialEquations.jl) to solve a system of n interacting particles. Let's say that the system is in 2D, and the equation of motion of each particle is described by a second-order ODE of its position with respect to time. Then four variables are needed for each particle, two for positions and two for velocities. Then 4n variables are needed to be declared. Is there a way to generalize, such that one does not need to list all 4n equations one by one?
For example:
http://diffeq.sciml.ai/latest/tutorials/ode_example.html#Example-2:-Solving-Systems-of-Equations-1
I try to modify the Lorenz equation in the link above a little bit to n particles (which is a very very rough attempt since I actually have no idea how to do it) by trying to extend u and du to 2D arrays.
using DifferentialEquations
using Plots
n = 4
function lorenz(du,u,p,t,i)
du[i,1] = 10.0*(u[i,2]-u[i,1])*sum(u[1:n,1])
du[i,2] = (u[i,1]*(28.0-u[i,3]) - u[i,2])*sum(u[1:n,1])
du[i,3] = (u[i,1]*u[i,2] - (8/3)*u[i,3])*sum(u[1:n,1])
end
u0 = hcat([1.0;0.0;0.0], [0.0;1.0;0.0], [0.0;0.0;1.0])
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
sol = solve(prob)
This, without surprise, does not work, but I hope that you get the idea what I am trying to do. Is there anyway for the ODE solver to solve u as a 2D array (or other ways that can achieve similar purposes?)

The problem is not the 2D Array. For example replacing your lorenz definition with
function lorenz(du,u,p,t)
du[1,1] = 10.0*(u[1,2]-u[1,1])
du[1,2] = (u[1,1]*(28.0-u[1,3]) - u[1,2])
du[1,3] = (u[1,1]*u[1,2] - (8/3)*u[1,3])
end
will work.
The problem is the function signature, the additional i is not supported. If you want to solve a network of Lorenz oscillators you have to code it with a function with the same signature, e.g. lorenz_network!(du, u, p, t) for the inplace version. Put a loop over the individual oscillators in your function and you are almost there.

What is the fastest way of converting a numpy array to a ctype array?

Here is a snippet of code I have to convert a numpy array to c_float ctype array so I can pass it to some functions in C language:
arr = my_numpy_array
arr = arr/255.
arr = arr.flatten()
new_arr = (c_float*len(arr))()
new_arr[:] = arr
but since the last line is actually a for loop and we all know how notorious python is when it comes to for loops for a medium size image array it takes about 0.2 seconds!! so this one line is right now the bottle neck of my whole pipeline. I want to know if there is any faster way of doing it.
Update
Please note "to pass to a function in C" in the question. To be more specific I want to put a numpy array in IMAGE data structure and pass it to rgbgr_image function. You can find both here

The OP's answer makes 4 copies of the my_numpu_array, at least 3 of which should be unnecessary. Here's a version that avoids them:
# random array for demonstration
my_numpy_array = np.random.randint(0, 255, (10, 10))
# copy my_numpy_array to a float32 array
arr = my_numpy_array.astype(np.float32)
# divide in place
arr /= 255
# reshape should return a view, not a copy, unlike flatten
ctypes_arr = np.ctypeslib.as_ctypes(arr.reshape(-1))
In some circumstances, reshape will return a copy, but since arr is guaranteed to own it's own data, it should return a view here.

So I managed to do it in this weird way using numpy:
arr = my_numpu_array
arr = arr/255.
arr = arr.flatten()
arr_float32 = np.copy(arr).astype(np.float32)
new_arr = np.ctypeslib.as_ctypes(arr_float32)
In my case it works 10 times faster.
[Edit]: I don't know why it doesn't work without np.copy or with reshape(-1). So it would be awesome if anyone can explain.

Stacking copies of an array/ a torch tensor efficiently?

I'm a Python/Pytorch user. First, in numpy, let's say I have an array M of size LxL, and i want to have the following
array: A=(M,...,M) of size, say, NxLxL, is there a more elegant/memory efficient way of doing it than :
A=np.array([M]*N) ?
Same question with torch tensor !
Cause, Now, if M is a Variable(torch.tensor), i have to do:
A=torch.autograd.Variable(torch.tensor(np.array([M]*N)))
which is ugly !

Note, that you need to decide whether you would like to allocate new memory for your expanded array or whether you simply require a new view of the existing memory of the original array.
In PyTorch, this distinction gives rise to the two methods expand() and repeat(). The former only creates a new view on the existing tensor where a dimension of size one is expanded to a larger size by setting the stride to 0. Any dimension of size 1 can be expanded to an arbitrary value without allocating new memory. In contrast, the latter copies the original data and allocates new memory.
In PyTorch, you can use expand() and repeat() as follows for your purposes:
import torch
L = 10
N = 20
A = torch.randn(L,L)
A.expand(N, L, L) # specifies new size
A.repeat(N,1,1) # specifies number of copies
In Numpy, there are a multitude of ways to achieve what you did above in a more elegant and efficient manner. For your particular purpose, I would recommend np.tile() over np.repeat(), since np.repeat() is designed to operate on the particular elements of an array, while np.tile() is designed to operate on the entire array. Hence,
import numpy as np
L = 10
N = 20
A = np.random.rand(L,L)
np.tile(A,(N, 1, 1))

If you don't mind creating new memory:
In numpy, you can use np.repeat() or np.tile(). With efficiency in mind, you should choose the one which organises the memory for your purposes, rather than re-arranging after the fact:
np.repeat([1, 2], 2) == [1, 1, 2, 2]
np.tile([1, 2], 2) == [1, 2, 1, 2]
In pytorch, you can use tensor.repeat(). Note: This matches np.tile, not np.repeat.
If you don't want to create new memory:
In numpy, you can use np.broadcast_to(). This creates a readonly view of the memory.
In pytorch, you can use tensor.expand(). This creates an editable view of the memory, so operations like += will have weird effects.

In numpy repeat is faster:
np.repeat(M[None,...], N,0)
I expand the dimensions of the M, and then repeat along that new dimension.

Cython - Define 2d arrays

here is the cython code i am trying to optimize,
import cython
cimport cython
from libc.stdlib cimport rand, srand, RAND_MAX
import numpy as np
cimport numpy as np
def genLoans(int loanid):
cdef int i, j, k
cdef double[:,:,:] loans = np.zeros((240, 20, 1000))
cdef double[:,:] aggloan = np.zeros((240, 20))
for j from 0<=j<1000:
srand(loanid*1000+j)
for i from 0<=i<240:
for k from 0<=k<20:
loans[i,k,j] = rand()
###some other logics
aggloan[i,k] += loans[i,k,j]/1000
return aggloan
cython -a shows
I guess when I trying to initialize zero array loans and aggloan, numpy slows me down. Yet i need to run 5000+ loans. Just wondering if there is other ways to avoid using numpy when i define 3d/2d and return arrays...

The yellow part is because of the Numpy call, where you allocate the array. What you can do is pass these arrays as arguments to the function, and reuse them from one to the next.
Also, I see you are rewriting all the elements, so you are claiming memory, writing it with zeroes, and then putting in your numbers. If you are sure you are overwriting all the elements, you can use np.empty, that will not initialize the variables.
Note: Linux kernel has a specific way of allocating memory initialised to 0, that is faster that any other value, and modern Numpy can use it, but it is still slower than empty:
In [4]: %timeit np.zeros((100,100))
100000 loops, best of 3: 4.04 µs per loop
In [5]: %timeit np.ones((100,100))
100000 loops, best of 3: 8.99 µs per loop
In [6]: %timeit np.empty((100,100))
1000000 loops, best of 3: 917 ns per loop
Last but not least, are you sure this is your bottleneck? I don't know what processing are you doing, but yellow is the number of lines of C code, not time. Anyway, from the timings, using empty should speed up that by a factor of four. If you want more, post the rest of your code at CR.
Edit:
Expanding on my second sentence: your function signature can be
def genLoans(int loanid, cdef double[:,:,:] loans, cdef double[:,:] aggloan):
You initialize the arrays before your loop, and just pass them again and again.
In any case, in my machine (Linux Intel i5), it takes 9µs, so you are spending a total of 45 ms. This is definitely not your bottleneck. Profile!

How to "invert" an array in linear time functionally rather than procedurally?

Say I have an array of integers A such that A[i] = j, and I want to "invert it"; that is, to create another array of integers B such that B[j] = i.
This is trivial to do procedurally in linear time in any language; here's a Python example:
def invert_procedurally(A):
B = [None] * (max(A) + 1)
for i, j in enumerate(A):
B[j] = i
return B
However, is there any way to do this functionally (as in functional programming, using map, reduce, or functions like those) in linear time?
The code might look something like this:
def invert_functionally(A):
# We can't modify variables in FP; we can only return a value
return map(???, A) # What goes here?
If this is not possible, what is the best (most efficient) alternative when doing functional programming?

In this context are arrays mutable or immutable? Generally I'd expect the mutable case to be about as straightforward as your Python implementation, perhaps aside from a few wrinkles with types. I'll assume you're more interested in the immutable scenario.
This operation inverts the indices and elements, so it's also important to know something about what constitutes valid array indices and impose those same constraints on the elements. Haskell has a class for index constraints called Ix. Any Ix type is ordered and has a range implementation to make an ordered list of indices ranging from one specified index to another. I think this Haskell implementation does what you want.
import Data.Array.IArray
invertArray :: (Ix x) => Array x x -> Array x x
invertArray arr = listArray (low,high) newElems
where oldElems = elems arr
newElems = indices arr
low = minimum oldElems
high = maximum oldElems
Under the hood listArray uses zipWith and range to associate indices in the specified range to the listed elements. That part ought to be linear time, and so is the one-time operation of extracting elements and indices from an array.
Whenever the sets of the input arrays indices and elements differ some elements will be undefined, which for better or worse blow up faster than Python's None. I believe you could overcome the undefined issue by implementing new Ix a instances over the Maybe monad, for instance.
Quick side-note: check out the invPerm example in the Haskell 98 Library Report. It does something similar to invertArray, but assumes up front that input array's elements are a permutation of its indices.

A solution needing mapand 3 operations:
toTuples views an the array as a list of tuples (i,e) where i is the index and e the element in the array at that index.
fromTuples creates and loads an array from a list of tuples.
swap which takes a tuple (a,b) and returns (b,a)
Hence the solution would be (in Haskellish notation):
invert = fromTuples . map swap . toTuples

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Calling a numpy function in Cython really slows things down - arrays

Related

Solve systems of ODEs in Julia with initial condition as 2D array

What is the fastest way of converting a numpy array to a ctype array?

Stacking copies of an array/ a torch tensor efficiently?

Cython - Define 2d arrays

How to "invert" an array in linear time functionally rather than procedurally?

Categories

Resources