A faster way to merge a list of 1D-arrays?

A faster way to merge a list of 1D-arrays? - arrays

I have a function distance that take a natural number as an input and return a 1-D array of length 199. My goal is to merge all the arrays distance(0), ..., distance(499). My code to do so is as follows:
import numpy as np
np.random.seed(42)
n = 200
d = 500
sample = np.random.uniform(size = [n, d])
def distance(i):
value = list(sample[i, 0:3])
temp = value - sample[(i + 1):n, 0:3]
return np.sqrt(np.sum(temp**2, axis = 1))
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
Because I work with large d, I want to optimize as good as possible. I would like to ask for a faster way to merge such arrays.
Thank you so much!

If you are just trying to compute the pairwise distance:
from scipy.spatial.distance import cdist
dist = cdist(sample[:,:3], sample[:,:3])
Of course you get back a symmetric array with all pairwise distances. To get your result, you can do:
result = dist[np.triu_indices(n,k=1)]
Regarding the broadcasting comment, cdist will do something similar to this:
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
For reference, below is the run time for each:
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
6.41 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.hstack(temp)
4.86 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.concatenate(temp)
4.28 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
result = dist[np.triu_indices(n,k=1)]
1.47 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = cdist(sample[:,:3], sample[:,:3])
result = dist[np.triu_indices(n,k=1)]
415 µs ± 26.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Most computationally efficient method to get the rest of the array of a slice in numpy array?

For a numpy array
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
You can get a slice using something like a[3:6]
But what about getting the rest of the slice? What is the most computationally efficient method for this? So something like a[:3, 6:].
The best I can come up with is to use a concatenate.
np.concatenate([a[:3], a[6:]], axis=0)
I am wondering if this is the best method, as I will be doing millions of these operations for a data processing pipeline.

Your solution seems to be the most efficient one since it is more than 2x faster than the next best thing.
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
%timeit -n 100000 np.concatenate([a[:3], a[6:]], axis=0)
%timeit -n 100000 np.delete(a, slice(3, 6))
%timeit -n 100000 a[np.r_[:3,6:]]
>2.03 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>4.61 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>11 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
However, the real question is if these operations (complement set of slice/deletion) need to be applied consecutively. Otherwise, you could aggregate the indices via set operations and slice the compliment a single time in the end to obtain the proper NumPy array.

I find declaring an empty array and filling it up seems to be very slightly better than using concat . As André mentioned in their comment this will vary based on the shape.
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
def testing123():
new = np.zeros(6, dtype=int)
new[:3] = a[0:3]
new[3:] = a[6:]
return new
%timeit -n 100000 np.concatenate([a[:3], a[6:]], axis=0)
100000 loops, best of 5: 2.18 µs per loop
%timeit -n 100000 np.delete(a, slice(3, 6))
100000 loops, best of 5: 6.11 µs per loop
%timeit -n 100000 a[np.r_[:3,6:]]
100000 loops, best of 5: 16.4 µs per loop
%timeit -n 100000 testing123()
100000 loops, best of 5: 2.01 µs per loop
a = np.arange(10_000)
def testing123():
new = np.empty(5000, dtype=int)
new[:2500] = a[:2500]
new[2500:] = a[7500:]
return new
%timeit -n 100000 np.concatenate([a[:2500], a[7500:]], axis=0)
100000 loops, best of 5: 3.99 µs per loop
%timeit -n 100000 np.delete(a, slice(2500, 7500))
100000 loops, best of 5: 7.76 µs per loop
%timeit -n 100000 a[np.r_[:2500,7500:]]
100000 loops, best of 5: 47.3 µs per loop
%timeit -n 100000 testing123()
100000 loops, best of 5: 3.61 µs per loop

Get max and std over array-fields in dataframe column pandas

(Pandas version 1.1.1.)
I have arrays as entries in the cells of a Dataframe column.
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
> float array
> 0 1 [1, 8]
> 1 2 [5, 14]
Now I need some statistics over each array position.
It works perfectly with the mean:
df['array'].mean()
> array([ 3., 11.])
But if I try to do it with the maximum or the standard deviation error occur:
df['array'].std()
> setting an array element with a sequence.
df['array'].max()
> The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It seems like .mean() .std() ánd .max() are constructed differently. Anyhow, has someone an idea how to caluculate the std and max (and min etc), without dividing the array into several columns?
(The DataFrame has array's of different shapes. But I do only want to caluculate statistics within a .groupyby() over rows where the arrays do have the same shape.)

You can convert columns to 2d arrays and use numpy for count:
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
#2k for test
df = pd.concat([df] * 1000, ignore_index=True)
In [150]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
4.25 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
944 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [152]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
4.31 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
836 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For 20k rows:
df = pd.concat([df] * 10000, ignore_index=True)
In [155]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
35.3 ms ± 87.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
9.13 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [157]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
35.3 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
8.21 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to find the index of the min/max object in a numpy array of objects?

In a numpy array of objects (where each object has a numeric attribute y that can be retrieved by the method get_y()), how do I obtain the index of the object with the maximum (or minimum) y attribute (without explicit looping; to save time)? If myarray were a python list, I could use the following, but ndarray does not seem to support index. Also, numpy argmin does not seem to allow a provision for supplying the key.
minindex = myarray.index(min(myarray, key = lambda x: x.get_y()))

Some timings, comparing a numeric dtype, object dtype, and lists. Draw your own conclusions:
In [117]: x = np.arange(1000)
In [118]: xo=x.astype(object)
In [119]: np.sum(x)
Out[119]: 499500
In [120]: np.sum(xo)
Out[120]: 499500
In [121]: timeit np.sum(x)
10.8 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [122]: timeit np.sum(xo)
39.2 µs ± 673 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [123]: sum(x)
Out[123]: 499500
In [124]: timeit sum(x)
214 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [125]: timeit sum(xo)
25.3 µs ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: timeit sum(x.tolist())
29.1 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [127]: timeit sum(xo.tolist())
14.4 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [129]: %%timeit temp=x.tolist()
...: sum(temp)
6.27 µs ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

how to optimize iteration on exponential in numpy

Say I have the following numpy array
n = 50
a = np.array(range(1, 1000)) / 1000.
I would like to execute this line of code
%timeit v = [a ** k for k in range(0, n)]
1000 loops, best of 3: 2.01 ms per loop
However, this line of code will ultimately be executed in a loop, therefore I have performance issues.
Is there a way to optimize the loop? For example, the result of a specific calculation i in the list comprehension is simply the result of the previous calculation result in the loop, multiplied by a again.
I don't mind storing the results in a 2d-array instead of arrays in a list. That would probably be cleaner. By the way, I also tried the following, but it yields similar performance results:
k = np.array(range(0, n))
ones = np.ones(n)
temp = np.outer(a, ones)
And then performed the following calculation
%timeit temp ** k
1000 loops, best of 3: 1.96 ms per loop
or
%timeit np.power(temp, k)
1000 loops, best of 3: 1.92 ms per loop
But both yields similar results to the list comprehension above. By the way, n will always be an integer in my case.

In quick tests cumprod seems to be faster.
In [225]: timeit v = np.array([a ** k for k in range(0, n)])
2.76 ms ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %%timeit
...: A=np.broadcast_to(a[:,None],(len(a),50))
...: v1=np.cumprod(A,axis=1)
...:
208 µs ± 42.3 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To compare values I have to tweak ranges, since v includes a 0 power, while v1 starts with a 1 power:
In [224]: np.allclose(np.array(v)[1:], v1.T[:-1])
Out[224]: True
But the timings suggest that cumprod is worth refining.
The proposed duplicate was Efficient way to compute the Vandermonde matrix. That still has good ideas.

numpy.array(list) being slow

So I have a list with 5,000,000 integers. And I want to cover the list to a numpy array. I tried following code:
numpy.array( list )
But it is very slow.
I benchmarked this operation for 100 times and loop over the list for 100 times. There is no much difference.
Any good idea how to make it faster?

If you have cython you can create a function that is definetly faster. But just a warning: It will crash if there are invalid elements inside your list (not-integers or too big integers).
I use the IPython magic here (%load_ext cython and %%cython), the point is to show how the function looks like - not to show how you can compile Cython code (it's not hard and Cythons "how-to-compile" documentation is quite good).
%load_ext cython
%%cython
cimport cython
import numpy as np
#cython.boundscheck(False)
cpdef to_array(list inp):
cdef long[:] arr = np.zeros(len(inp), dtype=long)
cdef Py_ssize_t idx
for idx in range(len(inp)):
arr[idx] = inp[idx]
return np.asarray(arr)
And the timings:
import numpy as np
def other(your_list): # the approach from #Damian Lattenero in the other answer
ret = np.zeros(shape=(len(your_list)), dtype=int)
np.copyto(ret, your_list)
return ret
inp = list(range(1000000))
%timeit np.array(inp)
# 315 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array(inp, dtype=int)
# 311 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit other(inp)
# 316 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_array(inp)
# 23.4 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it's more than 10 times faster.

I think this is fast, I checked the times:
import numpy as np
import time
start_time = time.time()
number = 1
elements = 10000000
your_list = [number] * elements
ret = np.zeros(shape=(len(your_list)))
np.copyto(ret, your_list)
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.7615997791290283 seconds ---

Make a big list of small integers; use the numpy crutch:
In [619]: arr = np.random.randint(0,256, 5000000)
In [620]: alist = arr.tolist()
In [621]: timeit alist = arr.tolist() # just for reference
10 loops, best of 3: 108 ms per loop
And time for plain list iteration (doesn't do anything)
In [622]: timeit [i for i in alist]
10 loops, best of 3: 193 ms per loop
Make an array of specified dtype
In [623]: arr8 = np.array(alist, 'uint8')
In [624]: timeit arr8 = np.array(alist, 'uint8')
1 loop, best of 3: 508 ms per loop
We can get a 2x improvement with fromiter; evidently it does less checking. np.array will work even if the list is a mix of numbers and strings. It also handles lists of lists etc.
In [625]: timeit arr81 = np.fromiter(alist, 'uint8')
1 loop, best of 3: 249 ms per loop
The advantage of working with arrays becomes apparent when we do math across the whole thing:
In [628]: timeit arr8.sum()
100 loops, best of 3: 6.93 ms per loop
In [629]: timeit sum(alist)
10 loops, best of 3: 74.4 ms per loop
In [630]: timeit 2*arr8
100 loops, best of 3: 6.89 ms per loop
In [631]: timeit [2*i for i in alist]
1 loop, best of 3: 465 ms per loop
It's well known that working with arrays is faster than with lists, but that there is a significant 'startup' overhead.