I want to turn a array of arrays into a matrix. To illustrate; let the array of arrays be:
[ [1,2,3], [4,5,6], [7,8,9]]
I would like to turn this into the 3x3 matrix:
[1 2 3
4 5 6
7 8 9]
How would you do this in Julia?
There are several ways of doing this. For instance, something along the lines of vcat(transpose.(a)...) will work as a one-liner
julia> a = [[1,2,3], [4,5,6], [7,8,9]]
3-element Vector{Vector{Int64}}:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
julia> vcat(transpose.(a)...)
3×3 Matrix{Int64}:
1 2 3
4 5 6
7 8 9
though note that
Since your inner arrays are column-vectors as written, you need to transpose them all before you can vertically concatenate (aka vcat) them (either that or horizontally concatenate and then transpose the whole result after, i.e., transpose(hcat(a...))), and
The splatting operator ... which makes this one-liner work will not be very efficient when applied to Arrays in general, and especially not when applied to larger arrays-of-arrays.
Performance-wise for larger arrays-of-arrays, it will likely actually be hard to beat preallocating a result of the right size and then simply filling with a loop, e.g.
result = similar(first(a), length(a), length(first(a)))
for i=1:length(a)
result[i,:] = a[i] # Aside: `=` is actually slightly faster than `.=` here, though either will have the same practical result in this case
end
Some quick benchmarks for reference:
julia> using BenchmarkTools
julia> #benchmark vcat(transpose.($a)...)
BechmarkTools.Trial: 10000 samples with 405 evaluations.
Range (min … max): 241.289 ns … 3.994 μs ┊ GC (min … max): 0.00% … 92.59%
Time (median): 262.836 ns ┊ GC (median): 0.00%
Time (mean ± σ): 289.105 ns ± 125.940 ns ┊ GC (mean ± σ): 2.06% ± 4.61%
▁▆▇█▇▆▅▅▅▄▄▄▄▃▂▂▂▃▃▂▂▁▁▁▂▄▃▁▁ ▁ ▁ ▂
████████████████████████████████▇▆▅▆▆▄▆▆▆▄▄▃▅▅▃▄▆▄▁▃▃▃▅▄▁▃▅██ █
241 ns Histogram: log(frequency) by time 534 ns <
Memory estimate: 320 bytes, allocs estimate: 5.
julia> #benchmark for i=1:length($a)
$result[i,:] = $a[i]
end
BechmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max): 33.966 ns … 124.918 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.710 ns ┊ GC (median): 0.00%
Time (mean ± σ): 39.795 ns ± 7.566 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄██▄▅▃ ▅▃ ▄▁▂ ▂▁▂▅▂▁ ▄▂▁ ▂
██████████████▇██████▆█▇▆███▆▇███▇▆▆▅▆▅▅▄▄▅▄▆▆▆▄▁▃▄▁▃▄▅▅▃▁▄█ █
34 ns Histogram: log(frequency) by time 77.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
In general, filling column-by-column (if possible) will be faster than filling row-by-row as we have done here, since Julia is column-major.
Expanding on #cbk's answer, another (slightly more efficient) one-liner is
julia> transpose(reduce(hcat, a))
3×3 transpose(::Matrix{Int64}) with eltype Int64:
1 2 3
4 5 6
7 8 9
[1 2 3; 4 5 6; 7 8 9]
# or
reshape(1:9, 3, 3)' # remember that ' makes the transpose of a Matrix
Related
The following function apply numpy functions to two numpy arrays.
import numpy as np
def my_func(a: np.ndarray, b: np.ndarray) -> float:
return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
>>> my_func(np.array([1., 2., np.nan]), np.array([1., np.nan]))
2.0
However what is the best way to apply this same function to an np.array of np.array of different shape ?
a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object) # First array shape (2,), second (3,)
b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
np.vectorize does work
>>> np.vectorize(my_func)(a, b)
array([2. , 2.5])
but as specified by the vectorize documentation:
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Is there a more clever solution ?
I could use np.pad to have identifical shape but it seems sub-optimal as it requires to pad up to the maximum length of the inside arrays (here 4 for a and 3 for b).
I looked at numba and this stack exchange about performance but I am not sure of the best pratice for such a case.
Thanks !
Your function and arrays:
In [222]: def my_func(a: np.ndarray, b: np.ndarray) -> float:
...: return np.nanmin(a, axis=0) + np.nanmin(b, axis=0)
...:
In [223]: a = np.array([np.array([1., 2]), np.array([1, 2., 3, np.nan])], dtype=object
...: ) # First array shape (2,), second (3,)
...: b = np.array([np.array([1]), np.array([1.5, 2.5, np.nan])], dtype=object)
In [224]: a
Out[224]: array([array([1., 2.]), array([ 1., 2., 3., nan])], dtype=object)
In [225]: b
Out[225]: array([array([1]), array([1.5, 2.5, nan])], dtype=object)
Compare vectorize with a straightforward list comprehension:
In [226]: np.vectorize(my_func)(a, b)
Out[226]: array([2. , 2.5])
In [227]: [my_func(i,j) for i,j in zip(a,b)]
Out[227]: [2.0, 2.5]
and their times:
In [228]: timeit np.vectorize(my_func)(a, b)
157 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [229]: timeit [my_func(i,j) for i,j in zip(a,b)]
85.9 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [230]: timeit np.array([my_func(i,j) for i,j in zip(a,b)])
89.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you are going to work with object arrays, frompyfunc is faster than vectorize:
In [231]: np.frompyfunc(my_func,2,1)(a, b)
Out[231]: array([2.0, 2.5], dtype=object)
In [232]: timeit np.frompyfunc(my_func,2,1)(a, b)
83.2 µs ± 50.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I'm a bit surprised that it's even better than the list comprehension.
frompyfunc (and vectorize) are more useful when the inputs need to 'broadcast' against each other:
In [233]: np.frompyfunc(my_func,2,1)(a[:,None], b)
Out[233]:
array([[2.0, 2.5],
[2.0, 2.5]], dtype=object)
I'm not a numba expert, but I suspect it doesn't handle object dtype arrays, or it it does it doesn't improve speed much. Remember, object dtype means the elements are object references, just like in lists.
I get better times by using otypes and taking the function creation out of the timing loop:
In [235]: %%timeit f=np.vectorize(my_func, otypes=[float])
...: f(a, b)
...:
...:
95.5 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [236]: %%timeit f=np.frompyfunc(my_func,2,1)
...: f(a, b)
...:
...:
81.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you don't know about otypes, you haven't read the np.vectorize docs well enough.
I am working on a project which include some simple array operations in a huge array.
i.e. A example here
function singleoperation!(A::Array,B::Array,C::Array)
#simd for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
I try to parallelize it to get a faster speed. To parallelize it, I am using distirbuded and share array function, which just modified a bit on the function I just show:
#everywhere function paralleloperation(A::SharedArray,B::SharedArray,C::SharedArray)
#sync #distributed for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
However, there has no time difference between two functions even I am using 4 threads (with the try on R7-5800x and I7-9750H CPU). Can I know anythings I can improve in this code? Thanks a lot! I will post the full testing code in below:
using Distributed
addprocs(4)
#everywhere begin
using SharedArrays
using BenchmarkTools
end
#everywhere function paralleloperation!(A::SharedArray,B::SharedArray,C::SharedArray)
#sync #distributed for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
function singleoperation!(A::Array,B::Array,C::Array)
#simd for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
N = 128;
A,B,C = fill(0,N,N,N),fill(.2,N,N,N),fill(.3,N,N,N);
AN,BN,CN = SharedArray(fill(0,N,N,N)),SharedArray(fill(.2,N,N,N)),SharedArray(fill(.3,N,N,N));
#benchmark singleoperation!(A,B,C);
BenchmarkTools.Trial: 1612 samples with 1 evaluation.
Range (min … max): 2.582 ms … 9.358 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.796 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.086 ms ± 790.997 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#benchmark paralleloperation!(AN,BN,CN);
BenchmarkTools.Trial: 1404 samples with 1 evaluation.
Range (min … max): 2.538 ms … 17.651 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.154 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.548 ms ± 1.238 ms ┊ GC (mean ± σ): 0.08% ± 1.65%
As the comments note, this looks like perhaps more of a job for multithreading than multiprocessing. The best approach in detail will generally depend on whether you are CPU-bound or memory-bandwith-bound. With so simple a calculation as in the example, it may well be the latter, in which case you will reach a point of diminishing returns from adding additional threads, and and may want to turn to something featuring explicit memory modelling, and/or to GPUs.
However, one very easy general-purpose approach would be to use the multithreading built-in to LoopVectorization.jl
A = rand(10000,10000)
B = rand(10000,10000)
C = zeros(10000,10000)
# Base
function singleoperation!(A,B,C)
#inbounds #simd for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
using LoopVectorization
function singleoperation_lv!(A,B,C)
#turbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
# Multithreaded (make sure you've started Julia with multiple threads)
function threadedoperation_lv!(A,B,C)
#tturbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
which gives us
julia> #benchmark singleoperation!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.484 ms … 164.022 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.664 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.701 ms ± 118.397 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄ ▄▄ ▁ ▁ ▁
▆▁▁▁▁▁▁▁▁▁▁▁▁▆█▆▆█▆██▁█▆▁▆█▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▁▁█▁▁▁▆▁▁▁▁▁▁▆▁▁▁▁▆ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark singleoperation_lv!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.252 ms … 163.754 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.408 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.453 ms ± 130.212 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃ ▃█▃ █ ▃ ▃
▇▁▁▁▁▁▇▁▁▁▇██▇▁▇███▇▁▁█▇█▁▁▁▁▁▁▁▁█▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▇▁▇▁▇ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark threadedoperation_lv!(A,B,C)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
Range (min … max): 86.976 ms … 88.595 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 87.642 ms ┊ GC (median): 0.00%
Time (mean ± σ): 87.727 ms ± 439.427 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▂ ▂
▅▁▁▁▁██▁▅█▁█▁▁▅▅█▅█▁█▅██▅█▁██▁▅▁▁▅▅▁▅▁▅▁▅▁▅▁▅▁▁▁▅▁█▁▁█▅▁▅▁▅█ ▁
87 ms Histogram: frequency by time 88.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Now, the fact that the singlethreaded LoopVectorization #turbo version is almost perfectly tied with the singlethreaded #inbounds #simd version is to me a hint that we are probably memory-bandwidth bound here (usually #turbo is notably faster than #inbounds #simd, so the tie suggests that the actual calculation is not the bottleneck) -- in which case the multithreaded version is only helping us by getting us access to a bit more memory bandwidth (though with diminishing returns, assuming there is some main memory bus that can only go so fast regardless of how many cores it can talk to).
To get a bit more insight, let's try making the arithmetic a bit harder:
function singlemoremath!(A,B,C)
#inbounds #simd for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
using LoopVectorization
function singlemoremath_lv!(A,B,C)
#turbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
function threadedmoremath_lv!(A,B,C)
#tturbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
then sure enough
julia> #benchmark singlemoremath!(A,B,C)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.651 s … 2.652 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.651 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.651 s ± 792.423 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.65 s Histogram: frequency by time 2.65 s <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark singlemoremath_lv!(A,B,C)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 268.101 ms … 270.072 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 269.016 ms ┊ GC (median): 0.00%
Time (mean ± σ): 269.058 ms ± 467.744 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁ ▁ ▁▁█ ▁ ██ ▁ ▁ ▁ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁███▁█▁▁██▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█ ▁
268 ms Histogram: frequency by time 270 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark threadedmoremath_lv!(A,B,C)
BenchmarkTools.Trial: 56 samples with 1 evaluation.
Range (min … max): 88.247 ms … 93.590 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.325 ms ┊ GC (median): 0.00%
Time (mean ± σ): 89.707 ms ± 1.200 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▁ ▄█ ▄▄▄ ▁ ▄ ▁ ▁ ▁ ▁ ▁▄
█▁█▆██▆▆▆███▆▁█▁█▁▆▁█▆▁█▆▆▁▁▆█▁▁▁▁▁▁█▆██▁▁▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▆▆ ▁
88.2 ms Histogram: frequency by time 92.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
now we're closer to CPU-bound, and now threading and SIMD-vectorization is the difference between 2.6 seconds and 90 ms!
If your real problem is going to be as memory-bound as the example problem, you may consider working on GPU, on a server optimized for memory bandwidth, and/or using a package that puts a lot of effort into memory modelling.
Some other packages you might check out could include Octavian.jl (CPU), Tullio.jl (CPU or GPU), and GemmKernels.jl (GPU).
(Pandas version 1.1.1.)
I have arrays as entries in the cells of a Dataframe column.
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
> float array
> 0 1 [1, 8]
> 1 2 [5, 14]
Now I need some statistics over each array position.
It works perfectly with the mean:
df['array'].mean()
> array([ 3., 11.])
But if I try to do it with the maximum or the standard deviation error occur:
df['array'].std()
> setting an array element with a sequence.
df['array'].max()
> The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It seems like .mean() .std() ánd .max() are constructed differently. Anyhow, has someone an idea how to caluculate the std and max (and min etc), without dividing the array into several columns?
(The DataFrame has array's of different shapes. But I do only want to caluculate statistics within a .groupyby() over rows where the arrays do have the same shape.)
You can convert columns to 2d arrays and use numpy for count:
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
#2k for test
df = pd.concat([df] * 1000, ignore_index=True)
In [150]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
4.25 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
944 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [152]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
4.31 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
836 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For 20k rows:
df = pd.concat([df] * 10000, ignore_index=True)
In [155]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
35.3 ms ± 87.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
9.13 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [157]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
35.3 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
8.21 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Basically I have two 1d numpy arrays, let's call them x and y, both of the same length. I want to essentially get the result x1y1 + x2y2 + ... + xn*yn. Obviously I could do this with a for loop but is there a built-in method or something where I can do this in one line?
What you are trying to compute is known as an 'inner product' and, in the case of two vectors, is called a 'dot product'. Numpy has built-in functions for computing both which are optimized for speed over the simple (x*y).sum() solution.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])
print(np.inner(a, b))
# 10
print(np.dot(a, b))
# 10
Some timing results in the table below with vectors a and b being 1000 randomly selected elements using np.random.randn:
np.dot(a, b) # 920 ns ± 9.9 ns
np.inner(a, b) # 1.1 µs ± 83.5 ns
(a*b).sum() # 4.2 µs ± 62.9 ns
np.sum(a*b) # 5.7 µs ± 170 ns
You can use sum(x*y) or (x*y).sum(), they're equivalent.
Is it possible to interleave two arrays in julia?
For example if a=[1:10] and b=[11:20] I want to be able to return
20-element Array{Int64,1}:
1
11
2
12
3
13
4
14
.
.
.
Somewhat similar to what ruby can do Merge and interleave two arrays in Ruby
There is a straightforward way to do this without needing to use the reshape() function. In particular, we can just bind the vectors into a matrix and then use [:] on the transpose of that matrix. For example:
julia> a = 1:10
julia> b = 11:20
julia> [a b]'[:]
20-element Array{Int64,1}:
1
11
2
12
3
13
.
.
.
20
Taking the transpose of the matrix [a b] gives us a 2-by-10 matrix, and then [:] returns all of its elements in the form of a vector. The reason [:] works so nicely for us is because Julia uses column-major ordering.
Figured it out!
reshape([a b]',20,1)
and for something more general:
reshape([a b].',size(a,1)+size(b,1),1)
we can use a hack to get vectors instead of 1D arrays:
reshape([a b].',size(a,1)+size(b,1),1)[:]
You could just use
reshape([a b].', length(a)+length(b))
to get a vector.
julia> #benchmark collect(Iterators.flatten(zip(a,b))) setup = begin a=rand(100); b=rand(100) end
BenchmarkTools.Trial: 10000 samples with 714 evaluations.
Range (min … max): 190.895 ns … 1.548 μs ┊ GC (min … max): 0.00% … 65.85%
Time (median): 238.843 ns ┊ GC (median): 0.00%
Time (mean ± σ): 265.428 ns ± 148.757 ns ┊ GC (mean ± σ): 8.40% ± 11.75%
▅▅██▅▃▂▂▁ ▁▁ ▂
██████████▇█▇▇▅▄▄▃▁▁▁▃▄█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▃▁▁▁▁▁▃▁▃▄▆▇███ █
191 ns Histogram: log(frequency) by time 1.11 μs <
Memory estimate: 1.77 KiB, allocs estimate: 1.
it seems that
collect(Iterators.flatten(zip(a,b)))
is much faster
For completeness, expanding #bdeonovic's solution to 2 dimensional arrays.
julia> a
2×2 Array{Int64,2}:
1 2
3 4
julia> b
2×2 Array{Int64,2}:
6 7
8 9
Interweaving rows:
julia> reshape([a[:] b[:]]', 4, 2)
4×2 Array{Int64,2}:
1 2
6 7
3 4
8 9
Interweaving columns:
julia> reshape( [a' b']', 2, 4 )
2×4 Array{Int64,2}:
1 6 2 7
3 8 4 9
Interweaving arrays (stacking/vcatting):
julia> reshape([a' b']', 4, 2)
4×2 Array{Int64,2}:
1 2
3 4
6 7
8 9