I am working on a project which include some simple array operations in a huge array.
i.e. A example here
function singleoperation!(A::Array,B::Array,C::Array)
#simd for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
I try to parallelize it to get a faster speed. To parallelize it, I am using distirbuded and share array function, which just modified a bit on the function I just show:
#everywhere function paralleloperation(A::SharedArray,B::SharedArray,C::SharedArray)
#sync #distributed for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
However, there has no time difference between two functions even I am using 4 threads (with the try on R7-5800x and I7-9750H CPU). Can I know anythings I can improve in this code? Thanks a lot! I will post the full testing code in below:
using Distributed
addprocs(4)
#everywhere begin
using SharedArrays
using BenchmarkTools
end
#everywhere function paralleloperation!(A::SharedArray,B::SharedArray,C::SharedArray)
#sync #distributed for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
function singleoperation!(A::Array,B::Array,C::Array)
#simd for k in eachindex(A)
#inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
N = 128;
A,B,C = fill(0,N,N,N),fill(.2,N,N,N),fill(.3,N,N,N);
AN,BN,CN = SharedArray(fill(0,N,N,N)),SharedArray(fill(.2,N,N,N)),SharedArray(fill(.3,N,N,N));
#benchmark singleoperation!(A,B,C);
BenchmarkTools.Trial: 1612 samples with 1 evaluation.
Range (min … max): 2.582 ms … 9.358 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.796 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.086 ms ± 790.997 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
#benchmark paralleloperation!(AN,BN,CN);
BenchmarkTools.Trial: 1404 samples with 1 evaluation.
Range (min … max): 2.538 ms … 17.651 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.154 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.548 ms ± 1.238 ms ┊ GC (mean ± σ): 0.08% ± 1.65%
As the comments note, this looks like perhaps more of a job for multithreading than multiprocessing. The best approach in detail will generally depend on whether you are CPU-bound or memory-bandwith-bound. With so simple a calculation as in the example, it may well be the latter, in which case you will reach a point of diminishing returns from adding additional threads, and and may want to turn to something featuring explicit memory modelling, and/or to GPUs.
However, one very easy general-purpose approach would be to use the multithreading built-in to LoopVectorization.jl
A = rand(10000,10000)
B = rand(10000,10000)
C = zeros(10000,10000)
# Base
function singleoperation!(A,B,C)
#inbounds #simd for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
using LoopVectorization
function singleoperation_lv!(A,B,C)
#turbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
# Multithreaded (make sure you've started Julia with multiple threads)
function threadedoperation_lv!(A,B,C)
#tturbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
which gives us
julia> #benchmark singleoperation!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.484 ms … 164.022 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.664 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.701 ms ± 118.397 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄ ▄▄ ▁ ▁ ▁
▆▁▁▁▁▁▁▁▁▁▁▁▁▆█▆▆█▆██▁█▆▁▆█▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▁▁█▁▁▁▆▁▁▁▁▁▁▆▁▁▁▁▆ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark singleoperation_lv!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.252 ms … 163.754 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.408 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.453 ms ± 130.212 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃ ▃█▃ █ ▃ ▃
▇▁▁▁▁▁▇▁▁▁▇██▇▁▇███▇▁▁█▇█▁▁▁▁▁▁▁▁█▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▇▁▇▁▇ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark threadedoperation_lv!(A,B,C)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
Range (min … max): 86.976 ms … 88.595 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 87.642 ms ┊ GC (median): 0.00%
Time (mean ± σ): 87.727 ms ± 439.427 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▂ ▂
▅▁▁▁▁██▁▅█▁█▁▁▅▅█▅█▁█▅██▅█▁██▁▅▁▁▅▅▁▅▁▅▁▅▁▅▁▅▁▁▁▅▁█▁▁█▅▁▅▁▅█ ▁
87 ms Histogram: frequency by time 88.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Now, the fact that the singlethreaded LoopVectorization #turbo version is almost perfectly tied with the singlethreaded #inbounds #simd version is to me a hint that we are probably memory-bandwidth bound here (usually #turbo is notably faster than #inbounds #simd, so the tie suggests that the actual calculation is not the bottleneck) -- in which case the multithreaded version is only helping us by getting us access to a bit more memory bandwidth (though with diminishing returns, assuming there is some main memory bus that can only go so fast regardless of how many cores it can talk to).
To get a bit more insight, let's try making the arithmetic a bit harder:
function singlemoremath!(A,B,C)
#inbounds #simd for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
using LoopVectorization
function singlemoremath_lv!(A,B,C)
#turbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
function threadedmoremath_lv!(A,B,C)
#tturbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
then sure enough
julia> #benchmark singlemoremath!(A,B,C)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.651 s … 2.652 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.651 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.651 s ± 792.423 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.65 s Histogram: frequency by time 2.65 s <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark singlemoremath_lv!(A,B,C)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 268.101 ms … 270.072 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 269.016 ms ┊ GC (median): 0.00%
Time (mean ± σ): 269.058 ms ± 467.744 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁ ▁ ▁▁█ ▁ ██ ▁ ▁ ▁ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁███▁█▁▁██▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█ ▁
268 ms Histogram: frequency by time 270 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark threadedmoremath_lv!(A,B,C)
BenchmarkTools.Trial: 56 samples with 1 evaluation.
Range (min … max): 88.247 ms … 93.590 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.325 ms ┊ GC (median): 0.00%
Time (mean ± σ): 89.707 ms ± 1.200 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▁ ▄█ ▄▄▄ ▁ ▄ ▁ ▁ ▁ ▁ ▁▄
█▁█▆██▆▆▆███▆▁█▁█▁▆▁█▆▁█▆▆▁▁▆█▁▁▁▁▁▁█▆██▁▁▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▆▆ ▁
88.2 ms Histogram: frequency by time 92.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
now we're closer to CPU-bound, and now threading and SIMD-vectorization is the difference between 2.6 seconds and 90 ms!
If your real problem is going to be as memory-bound as the example problem, you may consider working on GPU, on a server optimized for memory bandwidth, and/or using a package that puts a lot of effort into memory modelling.
Some other packages you might check out could include Octavian.jl (CPU), Tullio.jl (CPU or GPU), and GemmKernels.jl (GPU).
Related
I want to turn a array of arrays into a matrix. To illustrate; let the array of arrays be:
[ [1,2,3], [4,5,6], [7,8,9]]
I would like to turn this into the 3x3 matrix:
[1 2 3
4 5 6
7 8 9]
How would you do this in Julia?
There are several ways of doing this. For instance, something along the lines of vcat(transpose.(a)...) will work as a one-liner
julia> a = [[1,2,3], [4,5,6], [7,8,9]]
3-element Vector{Vector{Int64}}:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
julia> vcat(transpose.(a)...)
3×3 Matrix{Int64}:
1 2 3
4 5 6
7 8 9
though note that
Since your inner arrays are column-vectors as written, you need to transpose them all before you can vertically concatenate (aka vcat) them (either that or horizontally concatenate and then transpose the whole result after, i.e., transpose(hcat(a...))), and
The splatting operator ... which makes this one-liner work will not be very efficient when applied to Arrays in general, and especially not when applied to larger arrays-of-arrays.
Performance-wise for larger arrays-of-arrays, it will likely actually be hard to beat preallocating a result of the right size and then simply filling with a loop, e.g.
result = similar(first(a), length(a), length(first(a)))
for i=1:length(a)
result[i,:] = a[i] # Aside: `=` is actually slightly faster than `.=` here, though either will have the same practical result in this case
end
Some quick benchmarks for reference:
julia> using BenchmarkTools
julia> #benchmark vcat(transpose.($a)...)
BechmarkTools.Trial: 10000 samples with 405 evaluations.
Range (min … max): 241.289 ns … 3.994 μs ┊ GC (min … max): 0.00% … 92.59%
Time (median): 262.836 ns ┊ GC (median): 0.00%
Time (mean ± σ): 289.105 ns ± 125.940 ns ┊ GC (mean ± σ): 2.06% ± 4.61%
▁▆▇█▇▆▅▅▅▄▄▄▄▃▂▂▂▃▃▂▂▁▁▁▂▄▃▁▁ ▁ ▁ ▂
████████████████████████████████▇▆▅▆▆▄▆▆▆▄▄▃▅▅▃▄▆▄▁▃▃▃▅▄▁▃▅██ █
241 ns Histogram: log(frequency) by time 534 ns <
Memory estimate: 320 bytes, allocs estimate: 5.
julia> #benchmark for i=1:length($a)
$result[i,:] = $a[i]
end
BechmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max): 33.966 ns … 124.918 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.710 ns ┊ GC (median): 0.00%
Time (mean ± σ): 39.795 ns ± 7.566 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄██▄▅▃ ▅▃ ▄▁▂ ▂▁▂▅▂▁ ▄▂▁ ▂
██████████████▇██████▆█▇▆███▆▇███▇▆▆▅▆▅▅▄▄▅▄▆▆▆▄▁▃▄▁▃▄▅▅▃▁▄█ █
34 ns Histogram: log(frequency) by time 77.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
In general, filling column-by-column (if possible) will be faster than filling row-by-row as we have done here, since Julia is column-major.
Expanding on #cbk's answer, another (slightly more efficient) one-liner is
julia> transpose(reduce(hcat, a))
3×3 transpose(::Matrix{Int64}) with eltype Int64:
1 2 3
4 5 6
7 8 9
[1 2 3; 4 5 6; 7 8 9]
# or
reshape(1:9, 3, 3)' # remember that ' makes the transpose of a Matrix
(Pandas version 1.1.1.)
I have arrays as entries in the cells of a Dataframe column.
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
> float array
> 0 1 [1, 8]
> 1 2 [5, 14]
Now I need some statistics over each array position.
It works perfectly with the mean:
df['array'].mean()
> array([ 3., 11.])
But if I try to do it with the maximum or the standard deviation error occur:
df['array'].std()
> setting an array element with a sequence.
df['array'].max()
> The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It seems like .mean() .std() ánd .max() are constructed differently. Anyhow, has someone an idea how to caluculate the std and max (and min etc), without dividing the array into several columns?
(The DataFrame has array's of different shapes. But I do only want to caluculate statistics within a .groupyby() over rows where the arrays do have the same shape.)
You can convert columns to 2d arrays and use numpy for count:
a = np.array([1,8])
b = np.array([5,14])
df = pd.DataFrame({'float':[1,2], 'array': [a,b]})
#2k for test
df = pd.concat([df] * 1000, ignore_index=True)
In [150]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
4.25 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
944 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [152]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
4.31 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
836 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For 20k rows:
df = pd.concat([df] * 10000, ignore_index=True)
In [155]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).std())
35.3 ms ± 87.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit (np.std(np.array(df['array'].tolist()), ddof=1, axis=0))
9.13 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [157]: %timeit (pd.DataFrame(df['array'].tolist(), index=df.index).max())
35.3 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (np.max(np.array(df['array'].tolist()), axis=0))
8.21 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a function distance that take a natural number as an input and return a 1-D array of length 199. My goal is to merge all the arrays distance(0), ..., distance(499). My code to do so is as follows:
import numpy as np
np.random.seed(42)
n = 200
d = 500
sample = np.random.uniform(size = [n, d])
def distance(i):
value = list(sample[i, 0:3])
temp = value - sample[(i + 1):n, 0:3]
return np.sqrt(np.sum(temp**2, axis = 1))
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
Because I work with large d, I want to optimize as good as possible. I would like to ask for a faster way to merge such arrays.
Thank you so much!
If you are just trying to compute the pairwise distance:
from scipy.spatial.distance import cdist
dist = cdist(sample[:,:3], sample[:,:3])
Of course you get back a symmetric array with all pairwise distances. To get your result, you can do:
result = dist[np.triu_indices(n,k=1)]
Regarding the broadcasting comment, cdist will do something similar to this:
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
For reference, below is the run time for each:
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = [j for i in temp for j in i]
6.41 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.hstack(temp)
4.86 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
temp = [distance(i) for i in range(n - 1)]
result = np.concatenate(temp)
4.28 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = np.sum((sample[None,:,:3]-sample[:,None,:3])**2, axis=-1)**0.5
result = dist[np.triu_indices(n,k=1)]
1.47 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
dist = cdist(sample[:,:3], sample[:,:3])
result = dist[np.triu_indices(n,k=1)]
415 µs ± 26.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In a numpy array of objects (where each object has a numeric attribute y that can be retrieved by the method get_y()), how do I obtain the index of the object with the maximum (or minimum) y attribute (without explicit looping; to save time)? If myarray were a python list, I could use the following, but ndarray does not seem to support index. Also, numpy argmin does not seem to allow a provision for supplying the key.
minindex = myarray.index(min(myarray, key = lambda x: x.get_y()))
Some timings, comparing a numeric dtype, object dtype, and lists. Draw your own conclusions:
In [117]: x = np.arange(1000)
In [118]: xo=x.astype(object)
In [119]: np.sum(x)
Out[119]: 499500
In [120]: np.sum(xo)
Out[120]: 499500
In [121]: timeit np.sum(x)
10.8 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [122]: timeit np.sum(xo)
39.2 µs ± 673 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [123]: sum(x)
Out[123]: 499500
In [124]: timeit sum(x)
214 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [125]: timeit sum(xo)
25.3 µs ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: timeit sum(x.tolist())
29.1 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [127]: timeit sum(xo.tolist())
14.4 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [129]: %%timeit temp=x.tolist()
...: sum(temp)
6.27 µs ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I have a ncurses project, where I use mvwprintw to print a long string to a window.
mvwprintw(traceview_window_flatprofile, 1, 0, "%s", flatprofile_as_str());
the result looks like this:
% self children self children
time time time calls /call /call name
39.86 886 µs 0 ns 32 27697 ns 0 ns addr_translate [13]
25.69 571 µs 1454 µs 1 571 µs 1454 µs main [0]
7.02 156 µs 0 ns 1 156 µs 0 ns addr_fini [66]
6.28 139 µs 55006 ns 1 139 µs 55006 ns addr_init [2]
3.83 85094 ns 21956 ns 2 42547 ns 10978 ns flatprofile_snprintf [43]
2.08 46150 ns 0 ns 1 46150 ns 0 ns addr_read_symbol_table [3]
When I print the same string to stderr, using
fprintf(stderr, "%s\n", flatprofile_as_str());
the result looks like:
% self children self children
time time time calls /call /call name
39.86 886 µs 0 ns 32 27697 ns 0 ns addr_translate [13]
25.69 571 µs 1454 µs 1 571 µs 1454 µs main [0]
7.02 156 µs 0 ns 1 156 µs 0 ns addr_fini [66]
6.28 139 µs 55006 ns 1 139 µs 55006 ns addr_init [2]
3.83 85094 ns 21956 ns 2 42547 ns 10978 ns flatprofile_snprintf [43]
2.08 46150 ns 0 ns 1 46150 ns 0 ns addr_read_symbol_table [3]
Do you know what could cause this difference?
EDIT: in addition to the answer below, the following question solves a related issue.
How to make ncurses display UTF-8 chars correctly in C?
The difference seems to be caused by the special character µ i am not quite sure how you can fix it but you will probably have to adjust your flatprofile_as_str() function.
I remember having a similar problem with special chars from utf-8 and i solved it by using this function to count not the bytes but the actual lenght of a string:
int strlen_utf8(char *s) {
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}