I have a code where I do a lot of basic arithmetic calculations with a bunch of numerical data that is is multiple arrays. I have realized that in most concievable operations, numpy classes are always slower than the default python ones. Why is this?
For example I have a simple snippet where all I do is just update 1 numpy array element with another one retrieved from another numpy array, or I update it with the mathematical product of 2 other numpy array elements. It should be a basic operation, yet it will always be at least 2-3x slower than if I do it with list.
First I thought that it's because I haven't harmonized the data structures and the compiler has to do a lot of unecessary transformations. So then I recoded the whole thing and replaced every float with numpy.float64 and every list with numpy.ndarray, and the entire data is numpy.float64 all across the code so that it doesn't have to do any unecessary transformations.
The code is still 2-3 times slower than if I just use list and float.
For example:
ALPHA = [[random.uniform(*a_param) for k in range(l2)] for l in range(l1)]
COEFF = [[random.uniform(*c_param) for k in range(l2)] for l in range(l1)]
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
will always be 2-3x faster than:
ALPHA = numpy.random.uniform(*a_param, (l1,l2))
COEFF = numpy.random.uniform(*c_param, (l1,l2))
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
How is this possible, am I doing something wrong , since numpy is supposed to speed up things.
For the record I am using Python 3.5.3 and numpy (1.12.1), should I update?
Modifying a single element of a NumPy array is not expected to be faster than modifying a single element of a Python list. The speedup from using NumPy comes when you perform "vectorized" operations on entire arrays (or subsets of arrays). Try assigning the first 10000 elements of a NumPy array to be equal to the first 10000 elements of another, and compare that with using lists.
If your data and/or operations are very small (one or just a few elements), you are probably better off not using NumPy.
I tried two things:
Running your two blocks of code. For me, they were about the same speed.
Writing a new function that exploits numpy's vectorized math. This is several times faster than the other methods.
Here are my functions:
import numpy as np
def with_lists(l1, l2):
ALPHA = [[random.uniform(0, 1) for k in range(l2)] for l in range(l1)]
COEFF = [[random.uniform(0, 1) for k in range(l2)] for l in range(l1)]
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
return summa
def with_arrays(l1, l2):
ALPHA = np.random.uniform(size=(l1,l2))
COEFF = np.random.uniform(size=(l1,l2))
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
return summa
def with_ufunc(l1, l2):
"""Avoid the loop completely by exploitng numpy's
elementwise math."""
ALPHA = np.random.uniform(size=(l1,l2))
COEFF = np.random.uniform(size=(l1,l2))
return np.sum(COEFF * ALPHA)
When I compare the speed (I'm using the %timeit magic in IPython), I get the following:
>>> %timeit with_lists(10, 10)
107 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit with_arrays(10, 10)
91.9 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit with_ufunc(10, 10)
12.6 µs ± 589 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The third function, without loops, about 10 to 30 times faster on my machine, depending on the values of l1 and l2.
Related
The following function apply numpy functions to two numpy arrays.
import numpy as np
def my_func(a: np.ndarray, b: np.ndarray) -> float:
return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
>>> my_func(np.array([1., 2., np.nan]), np.array([1., np.nan]))
2.0
However what is the best way to apply this same function to an np.array of np.array of different shape ?
a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object) # First array shape (2,), second (3,)
b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
np.vectorize does work
>>> np.vectorize(my_func)(a, b)
array([2. , 2.5])
but as specified by the vectorize documentation:
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Is there a more clever solution ?
I could use np.pad to have identifical shape but it seems sub-optimal as it requires to pad up to the maximum length of the inside arrays (here 4 for a and 3 for b).
I looked at numba and this stack exchange about performance but I am not sure of the best pratice for such a case.
Thanks !
Your function and arrays:
In [222]: def my_func(a: np.ndarray, b: np.ndarray) -> float:
...: return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
...:
In [223]: a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object
...: ) # First array shape (2,), second (3,)
...: b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
In [224]: a
Out[224]: array([array([1., 2.]), array([ 1., 2., 3., nan])], dtype=object)
In [225]: b
Out[225]: array([array([1]), array([1.5, 2.5, nan])], dtype=object)
Compare vectorize with a straightforward list comprehension:
In [226]: np.vectorize(my_func)(a, b)
Out[226]: array([2. , 2.5])
In [227]: [my_func(i,j) for i,j in zip(a,b)]
Out[227]: [2.0, 2.5]
and their times:
In [228]: timeit np.vectorize(my_func)(a, b)
157 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [229]: timeit [my_func(i,j) for i,j in zip(a,b)]
85.9 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [230]: timeit np.array([my_func(i,j) for i,j in zip(a,b)])
89.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you are going to work with object arrays, frompyfunc is faster than vectorize:
In [231]: np.frompyfunc(my_func,2,1)(a, b)
Out[231]: array([2.0, 2.5], dtype=object)
In [232]: timeit np.frompyfunc(my_func,2,1)(a, b)
83.2 µs ± 50.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I'm a bit surprised that it's even better than the list comprehension.
frompyfunc (and vectorize) are more useful when the inputs need to 'broadcast' against each other:
In [233]: np.frompyfunc(my_func,2,1)(a[:,None], b)
Out[233]:
array([[2.0, 2.5],
[2.0, 2.5]], dtype=object)
I'm not a numba expert, but I suspect it doesn't handle object dtype arrays, or it it does it doesn't improve speed much. Remember, object dtype means the elements are object references, just like in lists.
I get better times by using otypes and taking the function creation out of the timing loop:
In [235]: %%timeit f=np.vectorize(my_func, otypes=[float])
...: f(a, b)
...:
...:
95.5 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [236]: %%timeit f=np.frompyfunc(my_func,2,1)
...: f(a, b)
...:
...:
81.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you don't know about otypes, you haven't read the np.vectorize docs well enough.
Basically I have two 1d numpy arrays, let's call them x and y, both of the same length. I want to essentially get the result x1y1 + x2y2 + ... + xn*yn. Obviously I could do this with a for loop but is there a built-in method or something where I can do this in one line?
What you are trying to compute is known as an 'inner product' and, in the case of two vectors, is called a 'dot product'. Numpy has built-in functions for computing both which are optimized for speed over the simple (x*y).sum() solution.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])
print(np.inner(a, b))
# 10
print(np.dot(a, b))
# 10
Some timing results in the table below with vectors a and b being 1000 randomly selected elements using np.random.randn:
np.dot(a, b) # 920 ns ± 9.9 ns
np.inner(a, b) # 1.1 µs ± 83.5 ns
(a*b).sum() # 4.2 µs ± 62.9 ns
np.sum(a*b) # 5.7 µs ± 170 ns
You can use sum(x*y) or (x*y).sum(), they're equivalent.
Say I have the following numpy array
n = 50
a = np.array(range(1, 1000)) / 1000.
I would like to execute this line of code
%timeit v = [a ** k for k in range(0, n)]
1000 loops, best of 3: 2.01 ms per loop
However, this line of code will ultimately be executed in a loop, therefore I have performance issues.
Is there a way to optimize the loop? For example, the result of a specific calculation i in the list comprehension is simply the result of the previous calculation result in the loop, multiplied by a again.
I don't mind storing the results in a 2d-array instead of arrays in a list. That would probably be cleaner. By the way, I also tried the following, but it yields similar performance results:
k = np.array(range(0, n))
ones = np.ones(n)
temp = np.outer(a, ones)
And then performed the following calculation
%timeit temp ** k
1000 loops, best of 3: 1.96 ms per loop
or
%timeit np.power(temp, k)
1000 loops, best of 3: 1.92 ms per loop
But both yields similar results to the list comprehension above. By the way, n will always be an integer in my case.
In quick tests cumprod seems to be faster.
In [225]: timeit v = np.array([a ** k for k in range(0, n)])
2.76 ms ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %%timeit
...: A=np.broadcast_to(a[:,None],(len(a),50))
...: v1=np.cumprod(A,axis=1)
...:
208 µs ± 42.3 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To compare values I have to tweak ranges, since v includes a 0 power, while v1 starts with a 1 power:
In [224]: np.allclose(np.array(v)[1:], v1.T[:-1])
Out[224]: True
But the timings suggest that cumprod is worth refining.
The proposed duplicate was Efficient way to compute the Vandermonde matrix. That still has good ideas.
So I have a list with 5,000,000 integers. And I want to cover the list to a numpy array. I tried following code:
numpy.array( list )
But it is very slow.
I benchmarked this operation for 100 times and loop over the list for 100 times. There is no much difference.
Any good idea how to make it faster?
If you have cython you can create a function that is definetly faster. But just a warning: It will crash if there are invalid elements inside your list (not-integers or too big integers).
I use the IPython magic here (%load_ext cython and %%cython), the point is to show how the function looks like - not to show how you can compile Cython code (it's not hard and Cythons "how-to-compile" documentation is quite good).
%load_ext cython
%%cython
cimport cython
import numpy as np
#cython.boundscheck(False)
cpdef to_array(list inp):
cdef long[:] arr = np.zeros(len(inp), dtype=long)
cdef Py_ssize_t idx
for idx in range(len(inp)):
arr[idx] = inp[idx]
return np.asarray(arr)
And the timings:
import numpy as np
def other(your_list): # the approach from #Damian Lattenero in the other answer
ret = np.zeros(shape=(len(your_list)), dtype=int)
np.copyto(ret, your_list)
return ret
inp = list(range(1000000))
%timeit np.array(inp)
# 315 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array(inp, dtype=int)
# 311 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit other(inp)
# 316 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_array(inp)
# 23.4 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it's more than 10 times faster.
I think this is fast, I checked the times:
import numpy as np
import time
start_time = time.time()
number = 1
elements = 10000000
your_list = [number] * elements
ret = np.zeros(shape=(len(your_list)))
np.copyto(ret, your_list)
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.7615997791290283 seconds ---
Make a big list of small integers; use the numpy crutch:
In [619]: arr = np.random.randint(0,256, 5000000)
In [620]: alist = arr.tolist()
In [621]: timeit alist = arr.tolist() # just for reference
10 loops, best of 3: 108 ms per loop
And time for plain list iteration (doesn't do anything)
In [622]: timeit [i for i in alist]
10 loops, best of 3: 193 ms per loop
Make an array of specified dtype
In [623]: arr8 = np.array(alist, 'uint8')
In [624]: timeit arr8 = np.array(alist, 'uint8')
1 loop, best of 3: 508 ms per loop
We can get a 2x improvement with fromiter; evidently it does less checking. np.array will work even if the list is a mix of numbers and strings. It also handles lists of lists etc.
In [625]: timeit arr81 = np.fromiter(alist, 'uint8')
1 loop, best of 3: 249 ms per loop
The advantage of working with arrays becomes apparent when we do math across the whole thing:
In [628]: timeit arr8.sum()
100 loops, best of 3: 6.93 ms per loop
In [629]: timeit sum(alist)
10 loops, best of 3: 74.4 ms per loop
In [630]: timeit 2*arr8
100 loops, best of 3: 6.89 ms per loop
In [631]: timeit [2*i for i in alist]
1 loop, best of 3: 465 ms per loop
It's well known that working with arrays is faster than with lists, but that there is a significant 'startup' overhead.
Say I have a 500000x1 array called A. I want to divide this array into 1000 equal sections, and then calculate the mean of that section. So I will end up with a 1000x1 array called B, in which B[1] is the mean of A[1:500], B[2] is the mean of B[501:1000]`, and so on. Since I will be doing this many many times, I want to do it efficiently. What's the most effective way of doing this in Matlab/Python?
NumPy/Python
We could reshape to have 500 columns and then compute average along the second axis -
A.reshape(-1,500).mean(axis=1)
Sample run -
In [89]: A = np.arange(50)+1;
In [90]: A.reshape(-1,5).mean(1)
Out[90]: array([ 3., 8., 13., 18., 23., 28., 33., 38., 43., 48.])
Runtime test :
An alternative method to get those average values would be with the old-fashioned way of computing the sum and then dividing by the number of elements involved in the summation. Let's time these two methods -
In [107]: A = np.arange(500000)+1;
In [108]: %timeit A.reshape(-1,500).mean(1)
1000 loops, best of 3: 1.19 ms per loop
In [109]: %timeit A.reshape(-1,500).sum(1)/500.0
1000 loops, best of 3: 583 µs per loop
Seems, like quite an improvement there with the alternative method! But wait, it's because with mean method NumPy is converting to float type by default and that conversion overhead showed up here.
So, if we use float type input arrays, we would have a different and a fair scenario -
In [144]: A = np.arange(500000).astype(float)+1;
In [145]: %timeit A.reshape(-1,500).mean(1)
1000 loops, best of 3: 534 µs per loop
In [146]: %timeit A.reshape(-1,500).sum(1)/500.0
1000 loops, best of 3: 516 µs per loop
MATLAB
With column-major ordering, we would reshape to have 500 rows and then average along the first dimension -
mean(reshape(A,500,[]),1)
Sample run -
>> A = 1:50;
>> mean(reshape(A,5,[]),1)
ans =
3 8 13 18 23 28 33 38 43 48
Runtime test :
Let's try out the old-fashioned way here too -
>> A = 1:500000;
>> func1 = #() mean(reshape(A,500,[]),1);
>> timeit(func1)
ans =
0.0013021
>> func2 = #() sum(reshape(A,500,[]),1)/500.0;
>> timeit(func2)
ans =
0.0012291