Optimisation of 4D tensor rotation - arrays

I have to perform the rotation of a 3x3x3x3 4D tensor +100k times per time step in a Stokes solver, where the rotated 4D tensor is Crot[i,j,k,l] = Crot[i,j,k,l] + Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p], with all indexes from 1 to 3.
So far I have naively written the following code in Julia:
Q = rand(3,3)
C = rand(3,3,3,3)
Crot = Array{Float64}(undef,3,3,3,3)
function rotation_4d!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
aux = 0.0
for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
for m = 1:3
for n = 1:3
for o = 1:3
for p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] += aux
end
end
end
end
end
With:
#btime rotation_4d(Crot,Q,C)
14.255 μs (0 allocations: 0 bytes)
Is there any way to optimise the code?

I timed the various einsum packages. Einsum is faster just by virtue of adding #inbounds. TensorOperations is slower for such small matrices. LoopVectorization takes an age to compile here, but the end result is faster.
(I presume you meant to zero aux once per element, for l = 1:3; aux = 0.0; for m = 1:3, and I set Crot .= 0 so as not to accumulate on top of junk.)
#btime rotation_4d!($Crot,$Q,$C) # 14.556 μs (0 allocations: 0 bytes)
Crot .= 0; # surely!
rotation_4d!(Crot,Q,C)
res = copy(Crot);
using Einsum # just adds #inbounds really
rot_ei!(Crot,Q,C) = #einsum Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Crot .= 0;
rot_ei!(Crot,Q,C) ≈ res # true
#btime rot_ei!($Crot,$Q,$C); # 7.445 μs (0 allocations: 0 bytes)
using TensorOperations # sends to BLAS
rot_to!(Crot,Q,C) = #tensor Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Crot .= 0;
rot_to!(Crot,Q,C) ≈ res # true
#btime rot_to!($Crot,$Q,$C); # 22.810 μs (106 allocations: 11.16 KiB)
using Tullio, LoopVectorization
rot_lv!(Crot,Q,C) = #tullio Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p] tensor=false
Crot .= 0;
#time rot_lv!(Crot,Q,C) ≈ res # 50 seconds!
#btime rot_lv!($Crot,$Q,$C); # 2.662 μs (8 allocations: 256 bytes)
However, this is still an awful algorithm. It's just 4 small matrix multiplications, but each one gets done many times. Doing them in series is much faster -- 9*4 * 27 multiplications, instead of [corrected!] 4 * 9^4 for the simple nesting above.
function rot2_ein!(Crot, Q, C)
#einsum mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
#einsum Crot[i,j,k,l] += Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
Crot .= 0; rot2_ein!(Crot,Q,C) ≈ res # true
#btime rot2_ein!($Crot, $Q, $C); # 1.585 μs (2 allocations: 784 bytes)
function rot4_ein!(Crot, Q, C) # overwrites Crot without addition
#einsum Crot[m,n,o,l] = Q[p,l] * C[m,n,o,p]
#einsum Crot[m,n,k,l] = Q[o,k] * Crot[m,n,o,l]
#einsum Crot[m,j,k,l] = Q[n,j] * Crot[m,n,k,l]
#einsum Crot[i,j,k,l] = Q[m,i] * Crot[m,j,k,l]
end
rot4_ein!(Crot,Q,C) ≈ res # true
#btime rot4_ein!($Crot, $Q, $C); # 1.006 μs

You're doing a lot of indexing here, and therefore a lot of bounds checking. One way to shave off some time here is to use the #inbounds macro, which turns bounds checking off. Rewriting your code as:
function rotation_4d!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
aux = 0.0
#inbounds for i = 1:3, j = 1:3, k = 1:3, l = 1:3
for m = 1:3, n = 1:3, o = 1:3, p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
Crot[i,j,k,l] += aux
end
end
gives me a roughly 3x speedup (6μs vs 18μs on my system).
You can read about this in the manual here. Note however that you need to make sure that all your dimensions are correctly sized, which makes working with hardcoded ranges like in your function tricky - consider using some of Julia's builtin iteration syntax (like eachindex) or using size(Q, 1) if you need your loops to change iterations numbers depending on inputs.

That seems to be a proper contraction (every index occuring either in the output, or exactly twice on the right hand side), and thus can be done with TensorOperations.jl:
#tensor Crot[i,j,k,l] = Crot[i,j,k,l] + Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
Or OMEinsum.jl.
It might also pay off to use StaticArrays.jl, since your tensor is small and of constant size. I don't know whether it works with any Einstein summation packages, but in any case you would be able to generate a completely unrolled function for the contraction.
(Note: I didn't actually test either of them for this case. If it is not a proper contraction, TensorOperations will complain at (I think) compile time.)

Related

How to count matches in two arrays?

If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3) of the three matching elements 3,4,and 5.
You can use intersect:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3
The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m + n) linear complexity. So, the overall algorithm complexity can be O(n log n).
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i += 1
j += 1
matches += 1
elseif x[i] < y[j]
i += 1
else
j += 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000 arrays, we get:
#btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)
A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
#btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.

Performance assigning and copying with StaticArrays.jl in Julia

I was thinking of using the package StaticArrays.jl to enhance the performance of my code. However, I only use arrays to store computed variables and use them later after certain conditions are set. Hence, I was benchmarking the type SizedVector in comparison with normal vector, but I do not understand to code below. I also tried StaticVector and used the work around Setfield.jl.
using StaticArrays, BenchmarkTools, Setfield
function copySized(n::Int64)
v = SizedVector{n, Int64}(zeros(n))
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copyStatic(n::Int64)
v = #SVector zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
#set v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copynormal(n::Int64)
v = zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
n = 10
#btime copySized($n)
#btime copyStatic($n)
#btime copynormal($n)
3.950 μs (42 allocations: 2.08 KiB)
5.417 μs (98 allocations: 4.64 KiB)
78.822 ns (2 allocations: 288 bytes)
Why does the case with SizedVector does have some much more allocations and hence worse performance? Do I not use SizedVector correctly? Should it not at least have the same performance as normal arrays?
Thank you in advance.
Cross post of Julia Discourse
I feel this is apples-to oranges comparison (and size should be store in statically in type). More illustrative code could look like this:
function copySized(::Val{n}) where n
v = SizedVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copyStatic(::Val{n}) where n
v = SVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copynormal(n)
v = [1:n;]
w = Vector{Int64}(undef, n)
w .= v
end
And now benchamrks:
julia> n = 10
10
julia> #btime copySized(Val{$n}());
248.138 ns (1 allocation: 144 bytes)
julia> #btime copyStatic(Val{$n}());
251.507 ns (1 allocation: 144 bytes)
julia> #btime copynormal($n);
77.940 ns (2 allocations: 288 bytes)
julia>
julia>
julia> n = 1000
1000
julia> #btime copySized(Val{$n}());
840.000 ns (2 allocations: 7.95 KiB)
julia> #btime copyStatic(Val{$n}());
830.769 ns (2 allocations: 7.95 KiB)
julia> #btime copynormal($n);
1.100 μs (2 allocations: 15.88 KiB)
#phipsgabler is right! Statically sized arrays have their performance advantages when the size is known statically, at compile time. My arrays are, however, dynamically sized, with the size n being a runtime variable.
Changing this yields more sensible results:
using StaticArrays, BenchmarkTools, Setfield
function copySized()
v = SizedVector{10, Float64}(zeros(10))
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copyStatic()
v = #SVector zeros(10)
w = Vector{Int64}(undef, 10*2)
for i in eachindex(v)
#set v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copynormal()
v = zeros(10)
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
#btime copySized()
#btime copyStatic()
#btime copynormal()
110.162 ns (3 allocations: 512 bytes)
48.133 ns (1 allocation: 224 bytes)
92.045 ns (2 allocations: 368 bytes)

Julia: A fast and elegant way to get a matrix from an array of arrays

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:
v = [[rand(),rand()], ..., [rand(),rand()]]
I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
w = hcat(x,y)
The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?
I hope this is short and efficient enough for you:
getindex.(v, [1 2])
and if you want something simpler to digest:
[v[i][j] for i in 1:length(v), j in 1:2]
Also the hcat solution could be written as:
permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));
and it should not hang your Julia (please confirm - it works for me).
#Antonello: to understand why this works consider a simpler example:
julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
"a1" "a2"
"b1" "b2"
"c1" "c2"
I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?
Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by #BogumilKaminski, but is faster:
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
You can simplify it a bit further, without losing performance, like this:
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
A benchmark of the various solutions posted so far...
using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]
M1 = #btime vcat([[e[1] e[2]] for e in $v]...)
M2 = #btime getindex.($v, [1 2])
M3 = #btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = #btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = #btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))
function original(v)
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
return hcat(x,y)
end
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
M6 = #btime original($v)
M7 = #btime mat($v)
M8 = #btime mat($v)
M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true
Output:
1.126 ms (10010 allocations: 1.53 MiB) # M1
54.161 μs (3 allocations: 156.42 KiB) # M2
809.000 μs (38983 allocations: 765.50 KiB) # M3
98.935 μs (4 allocations: 312.66 KiB) # M4
244.696 μs (10 allocations: 469.23 KiB) # M5
219.907 μs (30 allocations: 669.61 KiB) # M6
34.311 μs (2 allocations: 156.33 KiB) # M7
34.395 μs (2 allocations: 156.33 KiB) # M8
Note that the dollar sign in the benchmarked code is just to force #btime to consider the vector as a local variable.

Total numbers having frequency k in a given range

How to find total numbers having frequency=k in a particular range(l,r) in a given array. There are total 10^5 queries of format l,r and each query is built on the basis of previous query's answer. In particular, after each query we increment l by the result of the query, swapping l and r if l > r. Note that 0<=a[i]<=10^9. Total elements in array is n=10^5.
My Attempt:
n,k,q = map(int,input().split())
a = list(map(int,input().split()))
ans = 0
for _ in range(q):
l,r = map(int,input().split())
l+=ans
l%=n
r+=ans
r%=n
if l>r:
l,r = r,l
d = {}
for i in a[l:r+1]:
try:
d[i]+=1
except:
d[i] = 1
curr_ans = 0
for i in d.keys():
if d[i]==k:
curr_ans+=1
ans = curr_ans
print(ans)
Sample Input:
5 2 3
7 6 6 5 5
0 4
3 0
4 1
Sample Output:
2
1
1
If the number of different values in the array is not too large, you may consider storing arrays as long as the input array, one per unique value, counting the number of appearances of the value until each point. Then you just need to subtract the end values from the beginning values to find how many frequency matches are there:
def range_freq_queries(seq, k, queries):
n = len(seq)
c = freq_counts(seq)
result = [0] * len(queries)
offset = 0
for i, (l, r) in enumerate(queries):
result[i] = range_freq_matches(c, offset, l, r, k, n)
offset = result[i]
return result
def freq_counts(seq):
s = {v: i for i, v in enumerate(set(seq))}
counts = [None] * (len(seq) + 1)
counts[0] = [0] * len(s)
for i, v in enumerate(seq, 1):
counts[i] = list(counts[i - 1])
j = s[v]
counts[i][j] += 1
return counts
def range_freq_matches(counts, offset, start, end, k, n):
start, end = sorted(((start + offset) % n, (end + offset) % n))
num = 0
return sum(1 for cs, ce in zip(counts[start], counts[end + 1]) if ce - cs == k)
seq = [7, 6, 6, 5, 5]
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries(seq, k, queries))
# [2, 1, 1]
You can do it faster with NumPy, too. Since each result depends on the previous one, you will have to loop in any case, but you can use Numba to really accelerate things up:
import numpy as np
import numba as nb
def range_freq_queries_np(seq, k, queries):
seq = np.asarray(seq)
c = freq_counts_np(seq)
return _range_freq_queries_np_nb(seq, k, queries, c)
#nb.njit # This is not necessary but will make things faster
def _range_freq_queries_np_nb(seq, k, queries, c):
n = len(seq)
offset = np.int32(0)
out = np.empty(len(queries), dtype=np.int32)
for i, (l, r) in enumerate(queries):
l = (l + offset) % n
r = (r + offset) % n
l, r = min(l, r), max(l, r)
out[i] = np.sum(c[r + 1] - c[l] == k)
offset = out[i]
return out
def freq_counts_np(seq):
uniq = np.unique(seq)
seq_pad = np.concatenate([[uniq.max() + 1], seq])
comp = seq_pad[:, np.newaxis] == uniq
return np.cumsum(comp, axis=0)
seq = np.array([7, 6, 6, 5, 5])
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries_np(seq, k, queries))
# [2 1 2]
Let's compare it with the original algorithm:
from collections import Counter
def range_freq_queries_orig(seq, k, queries):
n = len(seq)
ans = 0
counter = Counter()
out = [0] * len(queries)
for i, (l, r) in enumerate(queries):
l += ans
l %= n
r += ans
r %= n
if l > r:
l, r = r, l
counter.clear()
counter.update(seq[l:r+1])
ans = sum(1 for v in counter.values() if v == k)
out[i] = ans
return out
Here is a quick test and timing:
import random
import numpy
# Make random input
random.seed(0)
seq = random.choices(range(1000), k=5000)
queries = [(random.choice(range(len(seq))), random.choice(range(len(seq))))
for _ in range(20000)]
k = 20
# Input as array for NumPy version
seq_arr = np.asarray(seq)
# Check all functions return the same result
res1 = range_freq_queries_orig(seq, k, queries)
res2 = range_freq_queries(seq, k, queries)
print(all(r1 == r2 for r1, r2 in zip(res1, res2)))
# True
res3 = range_freq_queries_np(seq_arr, k, queries)
print(all(r1 == r3 for r1, r3 in zip(res1, res3)))
# True
# Timings
%timeit range_freq_queries_orig(seq, k, queries)
# 3.07 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries(seq, k, queries)
# 1.1 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries_np(seq_arr, k, queries)
# 265 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Obviously the effectiveness of this depends on the characteristics of the data. In particular, if there are fewer repeated values the time and memory cost to construct the counts table will approach O(n2).
Let's say the input array is A, |A|=n. I'm going to assume that the number of distinct elements in A is much smaller than n.
We can divide A into sqrt(n) segments each of size sqrt(n). For each of these segments, we can calculate a map from element to count. Building these maps takes O(n) time.
With that preprocessing done, we can answer each query by adding together all the maps wholly contained in (l,r), of which there are at most sqrt(n), then adding any extra elements (or going one segment over and subtracting), also sqrt(n).
If there are k distinct elements, this takes O(sqrt(n) * k) so in the worst case O(n) if in fact every element of A is distinct.
You can keep track of the elements that have the desired count while combining the hashes and extra elements.

IndexError: index 10 is out of bounds for axis 0 with size 10

I am numerically setting up a mesh grid for the x-grid and x-vector and also time grid but again I have set up an array for x (position) which should only be between 0 and 20 and t (time) would be from 0 until 1000 thus in order to solve a Heat equation. But every time I want for e.g., I make the number of steps 10, I get an error:
"Traceback (most recent call last):
File "/home/universe/Desktop/Python/Heat_1.py", line 33, in <module>
x[i] = a + i*h
IndexError: index 10 is out of bounds for axis 0 with size 10"
Here is my code:
from math import sin,pi
import numpy
import numpy as np
#Constant variables
N = int(input("Number of intervals in x (<=20):"))
M = int(input("Number of time steps (<=1000):" ))
#Some initialised varibles
a = 0.0
b = 1.0
t_min = 0.0
t_max = 0.5
# Array Variables
x = np.linspace(a,b, M)
t = np.linspace(t_min, t_max, M)
#Some scalar variables
n = [] # the number of x-steps
i, s = [], [] # The position and time
# Get the number of x-steps to use
for n in range(0,N):
if n > 0 or n <= N:
continue
# Get the number of time steps to use
for m in range(0,M):
if m > 0 or n <= M:
continue
# Set up x-grid and x-vector
h =(b-a)/n
for i in range(0,N+1):
x[i] = a + i*h
# Set up time-grid
k = (t_max - t_min)/m
for s in range(0, M+1):
t[s] = t_min + k*s
print(x,t)
You try to index outside the range:
for s in range(0, M+1):
t[s] = t_min + k*s
Change to:
for s in range(M):
t[s] = t_min + k*s
And it works.
You create t with length of M:
t = np.linspace(t_min, t_max, M)
So you can only access M elements in t.
Python always starts indexing with zero. Therefore:
for s in range(M):
will do M loops, while:
for s in range(0, M+1):
will do M+1 loops.

Resources