Get max and std over array-fields in dataframe column pandas

Get max and std over array-fields in dataframe column pandas - arrays

(Pandas version 1.1.1.)
I have arrays as entries in the cells of a Dataframe column.
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
> float array
> 0 1 [1, 8]
> 1 2 [5, 14]
Now I need some statistics over each array position.
It works perfectly with the mean:
df['array'].mean()
> array([ 3., 11.])
But if I try to do it with the maximum or the standard deviation error occur:
df['array'].std()
> setting an array element with a sequence.
df['array'].max()
> The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It seems like .mean() .std() ánd .max() are constructed differently. Anyhow, has someone an idea how to caluculate the std and max (and min etc), without dividing the array into several columns?
(The DataFrame has array's of different shapes. But I do only want to caluculate statistics within a .groupyby() over rows where the arrays do have the same shape.)

You can convert columns to 2d arrays and use numpy for count:
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
#2k for test
df = pd.concat([df] * 1000, ignore_index=True)
In [150]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
4.25 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
944 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [152]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
4.31 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
836 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For 20k rows:
df = pd.concat([df] * 10000, ignore_index=True)
In [155]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
35.3 ms ± 87.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
9.13 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [157]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
35.3 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
8.21 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Most computationally efficient method to get the rest of the array of a slice in numpy array?

For a numpy array
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
You can get a slice using something like a[3:6]
But what about getting the rest of the slice? What is the most computationally efficient method for this? So something like a[:3, 6:].
The best I can come up with is to use a concatenate.
np.concatenate([a[:3], a[6:]], axis=0)
I am wondering if this is the best method, as I will be doing millions of these operations for a data processing pipeline.

Your solution seems to be the most efficient one since it is more than 2x faster than the next best thing.
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
%timeit -n 100000 np.concatenate([a[:3], a[6:]], axis=0)
%timeit -n 100000 np.delete(a, slice(3, 6))
%timeit -n 100000 a[np.r_[:3,6:]]
>2.03 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>4.61 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>11 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
However, the real question is if these operations (complement set of slice/deletion) need to be applied consecutively. Otherwise, you could aggregate the indices via set operations and slice the compliment a single time in the end to obtain the proper NumPy array.

I find declaring an empty array and filling it up seems to be very slightly better than using concat . As André mentioned in their comment this will vary based on the shape.
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
def testing123():
new = np.zeros(6, dtype=int)
new[:3] = a[0:3]
new[3:] = a[6:]
return new
%timeit -n 100000 np.concatenate([a[:3], a[6:]], axis=0)
100000 loops, best of 5: 2.18 µs per loop
%timeit -n 100000 np.delete(a, slice(3, 6))
100000 loops, best of 5: 6.11 µs per loop
%timeit -n 100000 a[np.r_[:3,6:]]
100000 loops, best of 5: 16.4 µs per loop
%timeit -n 100000 testing123()
100000 loops, best of 5: 2.01 µs per loop
a = np.arange(10_000)
def testing123():
new = np.empty(5000, dtype=int)
new[:2500] = a[:2500]
new[2500:] = a[7500:]
return new
%timeit -n 100000 np.concatenate([a[:2500], a[7500:]], axis=0)
100000 loops, best of 5: 3.99 µs per loop
%timeit -n 100000 np.delete(a, slice(2500, 7500))
100000 loops, best of 5: 7.76 µs per loop
%timeit -n 100000 a[np.r_[:2500,7500:]]
100000 loops, best of 5: 47.3 µs per loop
%timeit -n 100000 testing123()
100000 loops, best of 5: 3.61 µs per loop

Is there a better way than using np.vectorize to use function on ndarray of ndarray of different shape?

The following function apply numpy functions to two numpy arrays.
import numpy as np
def my_func(a: np.ndarray, b: np.ndarray) -> float:
return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
>>> my_func(np.array([1., 2., np.nan]), np.array([1., np.nan]))
2.0
However what is the best way to apply this same function to an np.array of np.array of different shape ?
a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object) # First array shape (2,), second (3,)
b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
np.vectorize does work
>>> np.vectorize(my_func)(a, b)
array([2. , 2.5])
but as specified by the vectorize documentation:
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Is there a more clever solution ?
I could use np.pad to have identifical shape but it seems sub-optimal as it requires to pad up to the maximum length of the inside arrays (here 4 for a and 3 for b).
I looked at numba and this stack exchange about performance but I am not sure of the best pratice for such a case.
Thanks !

Your function and arrays:
In [222]: def my_func(a: np.ndarray, b: np.ndarray) -> float:
...: return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
...:
In [223]: a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object
...: ) # First array shape (2,), second (3,)
...: b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
In [224]: a
Out[224]: array([array([1., 2.]), array([ 1., 2., 3., nan])], dtype=object)
In [225]: b
Out[225]: array([array([1]), array([1.5, 2.5, nan])], dtype=object)
Compare vectorize with a straightforward list comprehension:
In [226]: np.vectorize(my_func)(a, b)
Out[226]: array([2. , 2.5])
In [227]: [my_func(i,j) for i,j in zip(a,b)]
Out[227]: [2.0, 2.5]
and their times:
In [228]: timeit np.vectorize(my_func)(a, b)
157 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [229]: timeit [my_func(i,j) for i,j in zip(a,b)]
85.9 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [230]: timeit np.array([my_func(i,j) for i,j in zip(a,b)])
89.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you are going to work with object arrays, frompyfunc is faster than vectorize:
In [231]: np.frompyfunc(my_func,2,1)(a, b)
Out[231]: array([2.0, 2.5], dtype=object)
In [232]: timeit np.frompyfunc(my_func,2,1)(a, b)
83.2 µs ± 50.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I'm a bit surprised that it's even better than the list comprehension.
frompyfunc (and vectorize) are more useful when the inputs need to 'broadcast' against each other:
In [233]: np.frompyfunc(my_func,2,1)(a[:,None], b)
Out[233]:
array([[2.0, 2.5],
[2.0, 2.5]], dtype=object)
I'm not a numba expert, but I suspect it doesn't handle object dtype arrays, or it it does it doesn't improve speed much. Remember, object dtype means the elements are object references, just like in lists.
I get better times by using otypes and taking the function creation out of the timing loop:
In [235]: %%timeit f=np.vectorize(my_func, otypes=[float])
...: f(a, b)
...:
...:
95.5 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [236]: %%timeit f=np.frompyfunc(my_func,2,1)
...: f(a, b)
...:
...:
81.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you don't know about otypes, you haven't read the np.vectorize docs well enough.

A faster way to merge a list of 1D-arrays?

I have a function distance that take a natural number as an input and return a 1-D array of length 199. My goal is to merge all the arrays distance(0), ..., distance(499). My code to do so is as follows:
import numpy as np
np.random.seed(42)
n = 200
d = 500
sample = np.random.uniform(size = [n, d])
def distance(i):
value = list(sample[i, 0:3])
temp = value - sample[(i + 1):n, 0:3]
return np.sqrt(np.sum(temp**2, axis = 1))
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
Because I work with large d, I want to optimize as good as possible. I would like to ask for a faster way to merge such arrays.
Thank you so much!

If you are just trying to compute the pairwise distance:
from scipy.spatial.distance import cdist
dist = cdist(sample[:,:3], sample[:,:3])
Of course you get back a symmetric array with all pairwise distances. To get your result, you can do:
result = dist[np.triu_indices(n,k=1)]
Regarding the broadcasting comment, cdist will do something similar to this:
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
For reference, below is the run time for each:
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
6.41 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.hstack(temp)
4.86 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.concatenate(temp)
4.28 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
result = dist[np.triu_indices(n,k=1)]
1.47 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = cdist(sample[:,:3], sample[:,:3])
result = dist[np.triu_indices(n,k=1)]
415 µs ± 26.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to find the index of the min/max object in a numpy array of objects?

In a numpy array of objects (where each object has a numeric attribute y that can be retrieved by the method get_y()), how do I obtain the index of the object with the maximum (or minimum) y attribute (without explicit looping; to save time)? If myarray were a python list, I could use the following, but ndarray does not seem to support index. Also, numpy argmin does not seem to allow a provision for supplying the key.
minindex = myarray.index(min(myarray, key = lambda x: x.get_y()))

Some timings, comparing a numeric dtype, object dtype, and lists. Draw your own conclusions:
In [117]: x = np.arange(1000)
In [118]: xo=x.astype(object)
In [119]: np.sum(x)
Out[119]: 499500
In [120]: np.sum(xo)
Out[120]: 499500
In [121]: timeit np.sum(x)
10.8 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [122]: timeit np.sum(xo)
39.2 µs ± 673 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [123]: sum(x)
Out[123]: 499500
In [124]: timeit sum(x)
214 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [125]: timeit sum(xo)
25.3 µs ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: timeit sum(x.tolist())
29.1 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [127]: timeit sum(xo.tolist())
14.4 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [129]: %%timeit temp=x.tolist()
...: sum(temp)
6.27 µs ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

numpy.array(list) being slow

So I have a list with 5,000,000 integers. And I want to cover the list to a numpy array. I tried following code:
numpy.array( list )
But it is very slow.
I benchmarked this operation for 100 times and loop over the list for 100 times. There is no much difference.
Any good idea how to make it faster?

If you have cython you can create a function that is definetly faster. But just a warning: It will crash if there are invalid elements inside your list (not-integers or too big integers).
I use the IPython magic here (%load_ext cython and %%cython), the point is to show how the function looks like - not to show how you can compile Cython code (it's not hard and Cythons "how-to-compile" documentation is quite good).
%load_ext cython
%%cython
cimport cython
import numpy as np
#cython.boundscheck(False)
cpdef to_array(list inp):
cdef long[:] arr = np.zeros(len(inp), dtype=long)
cdef Py_ssize_t idx
for idx in range(len(inp)):
arr[idx] = inp[idx]
return np.asarray(arr)
And the timings:
import numpy as np
def other(your_list): # the approach from #Damian Lattenero in the other answer
ret = np.zeros(shape=(len(your_list)), dtype=int)
np.copyto(ret, your_list)
return ret
inp = list(range(1000000))
%timeit np.array(inp)
# 315 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array(inp, dtype=int)
# 311 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit other(inp)
# 316 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_array(inp)
# 23.4 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it's more than 10 times faster.

I think this is fast, I checked the times:
import numpy as np
import time
start_time = time.time()
number = 1
elements = 10000000
your_list = [number] * elements
ret = np.zeros(shape=(len(your_list)))
np.copyto(ret, your_list)
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.7615997791290283 seconds ---

Make a big list of small integers; use the numpy crutch:
In [619]: arr = np.random.randint(0,256, 5000000)
In [620]: alist = arr.tolist()
In [621]: timeit alist = arr.tolist() # just for reference
10 loops, best of 3: 108 ms per loop
And time for plain list iteration (doesn't do anything)
In [622]: timeit [i for i in alist]
10 loops, best of 3: 193 ms per loop
Make an array of specified dtype
In [623]: arr8 = np.array(alist, 'uint8')
In [624]: timeit arr8 = np.array(alist, 'uint8')
1 loop, best of 3: 508 ms per loop
We can get a 2x improvement with fromiter; evidently it does less checking. np.array will work even if the list is a mix of numbers and strings. It also handles lists of lists etc.
In [625]: timeit arr81 = np.fromiter(alist, 'uint8')
1 loop, best of 3: 249 ms per loop
The advantage of working with arrays becomes apparent when we do math across the whole thing:
In [628]: timeit arr8.sum()
100 loops, best of 3: 6.93 ms per loop
In [629]: timeit sum(alist)
10 loops, best of 3: 74.4 ms per loop
In [630]: timeit 2*arr8
100 loops, best of 3: 6.89 ms per loop
In [631]: timeit [2*i for i in alist]
1 loop, best of 3: 465 ms per loop
It's well known that working with arrays is faster than with lists, but that there is a significant 'startup' overhead.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Get max and std over array-fields in dataframe column pandas - arrays

Related

Most computationally efficient method to get the rest of the array of a slice in numpy array?

Is there a better way than using np.vectorize to use function on ndarray of ndarray of different shape?

A faster way to merge a list of 1D-arrays?

How to find the index of the min/max object in a numpy array of objects?

numpy.array(list) being slow

Categories

Resources