Performance assigning and copying with StaticArrays.jl in Julia - arrays

I was thinking of using the package StaticArrays.jl to enhance the performance of my code. However, I only use arrays to store computed variables and use them later after certain conditions are set. Hence, I was benchmarking the type SizedVector in comparison with normal vector, but I do not understand to code below. I also tried StaticVector and used the work around Setfield.jl.
using StaticArrays, BenchmarkTools, Setfield
function copySized(n::Int64)
v = SizedVector{n, Int64}(zeros(n))
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copyStatic(n::Int64)
v = #SVector zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
#set v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copynormal(n::Int64)
v = zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
n = 10
#btime copySized($n)
#btime copyStatic($n)
#btime copynormal($n)
3.950 μs (42 allocations: 2.08 KiB)
5.417 μs (98 allocations: 4.64 KiB)
78.822 ns (2 allocations: 288 bytes)
Why does the case with SizedVector does have some much more allocations and hence worse performance? Do I not use SizedVector correctly? Should it not at least have the same performance as normal arrays?
Thank you in advance.
Cross post of Julia Discourse

I feel this is apples-to oranges comparison (and size should be store in statically in type). More illustrative code could look like this:
function copySized(::Val{n}) where n
v = SizedVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copyStatic(::Val{n}) where n
v = SVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copynormal(n)
v = [1:n;]
w = Vector{Int64}(undef, n)
w .= v
end
And now benchamrks:
julia> n = 10
10
julia> #btime copySized(Val{$n}());
248.138 ns (1 allocation: 144 bytes)
julia> #btime copyStatic(Val{$n}());
251.507 ns (1 allocation: 144 bytes)
julia> #btime copynormal($n);
77.940 ns (2 allocations: 288 bytes)
julia>
julia>
julia> n = 1000
1000
julia> #btime copySized(Val{$n}());
840.000 ns (2 allocations: 7.95 KiB)
julia> #btime copyStatic(Val{$n}());
830.769 ns (2 allocations: 7.95 KiB)
julia> #btime copynormal($n);
1.100 μs (2 allocations: 15.88 KiB)

#phipsgabler is right! Statically sized arrays have their performance advantages when the size is known statically, at compile time. My arrays are, however, dynamically sized, with the size n being a runtime variable.
Changing this yields more sensible results:
using StaticArrays, BenchmarkTools, Setfield
function copySized()
v = SizedVector{10, Float64}(zeros(10))
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copyStatic()
v = #SVector zeros(10)
w = Vector{Int64}(undef, 10*2)
for i in eachindex(v)
#set v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copynormal()
v = zeros(10)
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
#btime copySized()
#btime copyStatic()
#btime copynormal()
110.162 ns (3 allocations: 512 bytes)
48.133 ns (1 allocation: 224 bytes)
92.045 ns (2 allocations: 368 bytes)

Related

How to count matches in two arrays?

If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3) of the three matching elements 3,4,and 5.
You can use intersect:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3
The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m + n) linear complexity. So, the overall algorithm complexity can be O(n log n).
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i += 1
j += 1
matches += 1
elseif x[i] < y[j]
i += 1
else
j += 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000 arrays, we get:
#btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)
A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
#btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.

Skip every nth element of array

How can I remove every nth element from an array in julia? Let's say I have the following array: a = [1 2 3 4 5 6] and I want b = [1 2 4 5]
In javascript I would do something like:
b = a.filter(e => e % 3);
How can it be done in Julia?
Your question title and text ask different questions. The title asks how to skip the Nth element, whereas the Javascript code snippet details how to skip elements based on their value, not their index.
Skipping by Value
We can do this using filter.
filter((x) -> x % 3 != 0, a)
This is basically equivalent to your Javascript code. We can, incidentally, also use broadcasting.
a[a .% 3 .!= 0]
This is more akin to code you would see in array-oriented languages like MATLAB and R.
Skipping by Index
With an extra enumerate call, we can get the indices to operate on.
map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate(a)))
This is roughly what you'd do in Python. enumerate to get the indices, filter to purge, then map to eliminate the now-unnecessary indices.
Or we can, again, use broadcasting.
a[(1:length(a)) .% 3 .!= 0]
If you need skipping by index the most elegant way is to use InvertedIndices
julia> using InvertedIndices # or using DataFrames
julia> a[Not(3:3:end)]
4-element Vector{Int64}:
1
2
4
5
As you can see all your job here is to provide a range of indices you wish to skip.
If you want to filter by the index, one convenient way is using a comprehension:
julia> a = 10:10:100;
julia> [a[i] for i in eachindex(a) if i % 3 != 0] |> permutedims
1×7 Matrix{Int64}:
10 20 40 50 70 80 100
julia> vec(ans) == [a[1 + 3(j-1)÷2] for j in 1:7]
true
This implicitly involves Iterators.filter, and collects the generator. You can also use this to filter by value, although the eager filter is probably more efficient:
julia> a = 1:10;
julia> [x for x in a if x%3!=0] |> permutedims
1×7 Matrix{Int64}:
1 2 4 5 7 8 10
Perhaps it's interesting to time all of these:
julia> using BenchmarkTools, InvertedIndices
julia> a = rand(1000); # filter by index
julia> i1 = #btime [$a[1 + 3(j-1)÷2] for j in 1:667];
373.162 ns (1 allocation: 5.38 KiB)
julia> i2 = #btime $a[eachindex($a) .% 3 .!= 0];
1.387 μs (4 allocations: 9.80 KiB)
julia> i3 = #btime [$a[i] for i in eachindex($a) if i % 3 != 0];
3.557 μs (11 allocations: 16.47 KiB)
julia> i4 = #btime map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate($a)));
4.202 μs (11 allocations: 16.47 KiB)
julia> i5 = #btime $a[Not(3:3:end)];
84.333 μs (4655 allocations: 182.28 KiB)
julia> i1 == i2 == i3 == i4 == i5
true
julia> a = rand(1:99, 1000); # filter by value
julia> v1 = #btime filter(x -> x%3!=0, $a);
532.185 ns (1 allocation: 7.94 KiB)
julia> v2 = #btime [x for x in $a if x%3!=0];
5.465 μs (11 allocations: 16.47 KiB)
julia> v1 == v2
true
This should help you:
b = a[Bool[i %3 != 0 for i = 1:length(a)]]
a[a .% 2 .!= 0]
please find the link with code.

Julia: A fast and elegant way to get a matrix from an array of arrays

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:
v = [[rand(),rand()], ..., [rand(),rand()]]
I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
w = hcat(x,y)
The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?
I hope this is short and efficient enough for you:
getindex.(v, [1 2])
and if you want something simpler to digest:
[v[i][j] for i in 1:length(v), j in 1:2]
Also the hcat solution could be written as:
permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));
and it should not hang your Julia (please confirm - it works for me).
#Antonello: to understand why this works consider a simpler example:
julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
"a1" "a2"
"b1" "b2"
"c1" "c2"
I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?
Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by #BogumilKaminski, but is faster:
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
You can simplify it a bit further, without losing performance, like this:
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
A benchmark of the various solutions posted so far...
using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]
M1 = #btime vcat([[e[1] e[2]] for e in $v]...)
M2 = #btime getindex.($v, [1 2])
M3 = #btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = #btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = #btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))
function original(v)
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
return hcat(x,y)
end
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
M6 = #btime original($v)
M7 = #btime mat($v)
M8 = #btime mat($v)
M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true
Output:
1.126 ms (10010 allocations: 1.53 MiB) # M1
54.161 μs (3 allocations: 156.42 KiB) # M2
809.000 μs (38983 allocations: 765.50 KiB) # M3
98.935 μs (4 allocations: 312.66 KiB) # M4
244.696 μs (10 allocations: 469.23 KiB) # M5
219.907 μs (30 allocations: 669.61 KiB) # M6
34.311 μs (2 allocations: 156.33 KiB) # M7
34.395 μs (2 allocations: 156.33 KiB) # M8
Note that the dollar sign in the benchmarked code is just to force #btime to consider the vector as a local variable.

When is `.=` more efficient than `=`?

Consider the following REPL lines using BenchmarkTools:
julia> N = 10^2; M = collect(reshape(1:N^2,N,N)); e = collect(1:N); # N=100
julia> #btime M[:,1] .= e; #btime M[:,1] = e;
1.211 μs (6 allocations: 128 bytes)
364.623 ns (1 allocation: 16 bytes)
julia> N = 10^3; M = collect(reshape(1:N^2,N,N)); e = collect(1:N); # N=1000
julia> #btime M[:,1] .= e; #btime M[:,1] = e;
1.511 μs (6 allocations: 128 bytes)
1.634 μs (1 allocation: 16 bytes)
julia> N = 10^4; M = collect(reshape(1:N^2,N,N)); e = collect(1:N); # N=10000
julia> #btime M[:,1] .= e; #btime M[:,1] = e;
3.514 μs (6 allocations: 128 bytes)
13.230 μs (1 allocation: 16 bytes)
It seems that .= is more efficient than =, but only for large N. I still do not understand very well what's happening under the hood and do not find explanations in the Julia documentation. When should I use one or the other?

Optimisation of 4D tensor rotation

I have to perform the rotation of a 3x3x3x3 4D tensor +100k times per time step in a Stokes solver, where the rotated 4D tensor is Crot[i,j,k,l] = Crot[i,j,k,l] + Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p], with all indexes from 1 to 3.
So far I have naively written the following code in Julia:
Q = rand(3,3)
C = rand(3,3,3,3)
Crot = Array{Float64}(undef,3,3,3,3)
function rotation_4d!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
aux = 0.0
for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
for m = 1:3
for n = 1:3
for o = 1:3
for p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] += aux
end
end
end
end
end
With:
#btime rotation_4d(Crot,Q,C)
14.255 μs (0 allocations: 0 bytes)
Is there any way to optimise the code?
I timed the various einsum packages. Einsum is faster just by virtue of adding #inbounds. TensorOperations is slower for such small matrices. LoopVectorization takes an age to compile here, but the end result is faster.
(I presume you meant to zero aux once per element, for l = 1:3; aux = 0.0; for m = 1:3, and I set Crot .= 0 so as not to accumulate on top of junk.)
#btime rotation_4d!($Crot,$Q,$C) # 14.556 μs (0 allocations: 0 bytes)
Crot .= 0; # surely!
rotation_4d!(Crot,Q,C)
res = copy(Crot);
using Einsum # just adds #inbounds really
rot_ei!(Crot,Q,C) = #einsum Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Crot .= 0;
rot_ei!(Crot,Q,C) ≈ res # true
#btime rot_ei!($Crot,$Q,$C); # 7.445 μs (0 allocations: 0 bytes)
using TensorOperations # sends to BLAS
rot_to!(Crot,Q,C) = #tensor Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Crot .= 0;
rot_to!(Crot,Q,C) ≈ res # true
#btime rot_to!($Crot,$Q,$C); # 22.810 μs (106 allocations: 11.16 KiB)
using Tullio, LoopVectorization
rot_lv!(Crot,Q,C) = #tullio Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p] tensor=false
Crot .= 0;
#time rot_lv!(Crot,Q,C) ≈ res # 50 seconds!
#btime rot_lv!($Crot,$Q,$C); # 2.662 μs (8 allocations: 256 bytes)
However, this is still an awful algorithm. It's just 4 small matrix multiplications, but each one gets done many times. Doing them in series is much faster -- 9*4 * 27 multiplications, instead of [corrected!] 4 * 9^4 for the simple nesting above.
function rot2_ein!(Crot, Q, C)
#einsum mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
#einsum Crot[i,j,k,l] += Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
Crot .= 0; rot2_ein!(Crot,Q,C) ≈ res # true
#btime rot2_ein!($Crot, $Q, $C); # 1.585 μs (2 allocations: 784 bytes)
function rot4_ein!(Crot, Q, C) # overwrites Crot without addition
#einsum Crot[m,n,o,l] = Q[p,l] * C[m,n,o,p]
#einsum Crot[m,n,k,l] = Q[o,k] * Crot[m,n,o,l]
#einsum Crot[m,j,k,l] = Q[n,j] * Crot[m,n,k,l]
#einsum Crot[i,j,k,l] = Q[m,i] * Crot[m,j,k,l]
end
rot4_ein!(Crot,Q,C) ≈ res # true
#btime rot4_ein!($Crot, $Q, $C); # 1.006 μs
You're doing a lot of indexing here, and therefore a lot of bounds checking. One way to shave off some time here is to use the #inbounds macro, which turns bounds checking off. Rewriting your code as:
function rotation_4d!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
aux = 0.0
#inbounds for i = 1:3, j = 1:3, k = 1:3, l = 1:3
for m = 1:3, n = 1:3, o = 1:3, p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
Crot[i,j,k,l] += aux
end
end
gives me a roughly 3x speedup (6μs vs 18μs on my system).
You can read about this in the manual here. Note however that you need to make sure that all your dimensions are correctly sized, which makes working with hardcoded ranges like in your function tricky - consider using some of Julia's builtin iteration syntax (like eachindex) or using size(Q, 1) if you need your loops to change iterations numbers depending on inputs.
That seems to be a proper contraction (every index occuring either in the output, or exactly twice on the right hand side), and thus can be done with TensorOperations.jl:
#tensor Crot[i,j,k,l] = Crot[i,j,k,l] + Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Or OMEinsum.jl.
It might also pay off to use StaticArrays.jl, since your tensor is small and of constant size. I don't know whether it works with any Einstein summation packages, but in any case you would be able to generate a completely unrolled function for the contraction.
(Note: I didn't actually test either of them for this case. If it is not a proper contraction, TensorOperations will complain at (I think) compile time.)

Resources