Julia: A fast and elegant way to get a matrix from an array of arrays - arrays

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:
v = [[rand(),rand()], ..., [rand(),rand()]]
I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
w = hcat(x,y)
The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?

I hope this is short and efficient enough for you:
getindex.(v, [1 2])
and if you want something simpler to digest:
[v[i][j] for i in 1:length(v), j in 1:2]
Also the hcat solution could be written as:
permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));
and it should not hang your Julia (please confirm - it works for me).
#Antonello: to understand why this works consider a simpler example:
julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
"a1" "a2"
"b1" "b2"
"c1" "c2"
I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?

Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by #BogumilKaminski, but is faster:
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
You can simplify it a bit further, without losing performance, like this:
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end

A benchmark of the various solutions posted so far...
using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]
M1 = #btime vcat([[e[1] e[2]] for e in $v]...)
M2 = #btime getindex.($v, [1 2])
M3 = #btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = #btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = #btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))
function original(v)
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
return hcat(x,y)
end
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
M6 = #btime original($v)
M7 = #btime mat($v)
M8 = #btime mat($v)
M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true
Output:
1.126 ms (10010 allocations: 1.53 MiB) # M1
54.161 μs (3 allocations: 156.42 KiB) # M2
809.000 μs (38983 allocations: 765.50 KiB) # M3
98.935 μs (4 allocations: 312.66 KiB) # M4
244.696 μs (10 allocations: 469.23 KiB) # M5
219.907 μs (30 allocations: 669.61 KiB) # M6
34.311 μs (2 allocations: 156.33 KiB) # M7
34.395 μs (2 allocations: 156.33 KiB) # M8
Note that the dollar sign in the benchmarked code is just to force #btime to consider the vector as a local variable.

Related

Julia pairwise broadcast

I would like to compare every pair of strings in a list of strings in Julia. One way to do it is
equal_strs = [(x == y) for x in str_list, y in str_list]
However, if I use broadcast as follows:
equal_strs = broadcast(==, str_list, str_list)
it returns a vector instead of a 2D array. Is there a way to output a 2D array using broadcast?
Broadcasting works by expanding ("broadcasting") dimensions that do not have the same length, in a way such that an array with (for example) size Nx1xM broadcasted with a NxKx1 gives an NxKxM array.
This means that if you broadcast an operation with to length N vectors, you will get a length N vector.
So you need one string array to be a length N vector, and the other an 1xM matrix:
julia> using Random
julia> str1 = [randstring('A':'C', 3) for _ in 1:5]
5-element Vector{String}:
"ACC"
"CBC"
"AAC"
"CAB"
"BAB"
1.8.0> str2 = [randstring('A':'C', 3) for _ in 1:4]
4-element Vector{String}:
"ABB"
"BAB"
"CAA"
"BBC"
1.8.0> str1 .== permutedims(str2)
5×4 BitMatrix:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
permutedims will change a length N vector into a 1xN matrix.
BTW, you would very rarely use broadcast in your code (broadcast(==, a, b)), instead, use the dot syntax, a .== b, which is more idiomatic.
You should have one vector transposed for the broadcasting machinery to build a matrix by expanding dimensions of the inputs to agree.
julia> str_list = ["hello", "car", "me", "good", "people", "good"];
julia> equal_strs = broadcast(==, str_list, permutedims(str_list))
6×6 BitMatrix:
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 1
0 0 0 0 1 0
0 0 0 1 0 1
Also, the following are similar.
equal_strs = str_list .== permutedims(str_list)
equal_strs = isequal.(str_list, permutedims(str_list))
Will assume that by "list" you mean a Vector as there are no python-like lists in Julia. If you meant a tuple I would suggest converting it into a Vector anyway because broadcasting is best used with Arrays (which Vector is a subtype of).
str_list = ["one", "two", "three", "one", "two"]
Now you simply do
broadcast(==, str_list, permutedims(str_list))
or more concise with the dot operator
str_list .== permutedims(str_list)
What happens under the hood:
broadcasting in Julia works element-wise, so if you have 2 Vectors it will not do anything as the dimensions match.
But if you have a Vector and a Matrix (Vector is a 1D Array, and Matrix is a 2D array) with the shapes of (N,1) and (1,N) Julia will broadcast the 1 dimension giving you a Matrix of shape (N,N) which is what you want.
Now usually with numbers you would do ' instead of permutedims
num_list .== num_list'
as to why it doesn't work with strings see this answer.
lst .== permutedims(lst) is a perfectly good method to find the result, as suggested by other answers. But it takes O(n^2) comparisons, and if the list is long, it might be better to use an O(n*log(n)) comparisons algorithm. Following is an implementation of such as algorithm with a little benchmark:
function equal_str(lst)
sp = sortperm(lst)
isp = invperm(sp)
same = [i==1 ? false : lst[sp[i]]==lst[sp[i-1]] for i=1:length(lst)]
ac = accumulate((k,v)-> ifelse(v==false, k+1, k), same; init=0)
return [ ac[isp[i]]==ac[isp[j]] for i=1:length(lst),j=1:length(lst) ]
end
and the benchmark gives:
julia> using Random
julia> using BenchmarkTools
julia> lst = [randstring('A':'C',3) for i=1:40];
julia> show(lst)
["CBA", "CAB", "BCA", "AAC", "AAA", "ABC", "BBA", "CAB", "CBC", "CCA",
"BCC", "BCB", "CAB", "BCB", "ACC", "CBC", "CCC", "CCB", "BCB", "BCB",
"ABA", "AAC", "CCC", "ABC", "BAC", "CAB", "BAB", "BCB", "CCA", "CAC",
"AAA", "BBC", "ABC", "BCB", "CBA", "CAA", "CAB", "CAC", "CBC", "CBC"]
julia> #btime $lst .== permutedims($lst) ;
9.025 μs (5 allocations: 4.58 KiB)
julia> #btime equal_str($lst) ;
6.112 μs (8 allocations: 3.08 KiB)
The larger the lst the bigger the difference would be. This applies only to comparing a list with itself, as the OP suggests. To compare two lists, a different algorithm should be employed for O(n*log(n)) time.
Finally, even this algorithm works a little too hard by sorting, but an O(n^2) time/space complexity is inherent in producing the result.
UPDATE:
A more linear O(n) time calculation (still O(n^2) to make matrix):
function equal_str_2(lst)
d = Dict{String,Int}()
d2 = Dict{Int, Vector{Int}}()
for p in pairs(lst)
if haskey(d,p[2])
push!(d2[d[p[2]]],p[1])
else
d[p[2]] = p[1]
d2[p[1]] = [p[1]]
end
end
res = zeros(Bool, (length(lst), length(lst)))
for p in values(d2)
for q in Iterators.product(p,p)
res[q[1],q[2]] = true
res[q[2], q[1]] = true
end
end
return res
end
and benchmark with larger lst:
julia> lst = [randstring('A':'C',3) for i=1:140];
julia> #btime $lst .== permutedims($lst) ;
99.094 μs (5 allocations: 6.89 KiB)
julia> #btime equal_str($lst) ;
51.981 μs (9 allocations: 23.12 KiB)
julia> #btime equal_str_2($lst) ;
21.539 μs (72 allocations: 27.47 KiB)

How to count matches in two arrays?

If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3) of the three matching elements 3,4,and 5.
You can use intersect:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3
The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m + n) linear complexity. So, the overall algorithm complexity can be O(n log n).
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i += 1
j += 1
matches += 1
elseif x[i] < y[j]
i += 1
else
j += 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000 arrays, we get:
#btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)
A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
#btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.

Performance assigning and copying with StaticArrays.jl in Julia

I was thinking of using the package StaticArrays.jl to enhance the performance of my code. However, I only use arrays to store computed variables and use them later after certain conditions are set. Hence, I was benchmarking the type SizedVector in comparison with normal vector, but I do not understand to code below. I also tried StaticVector and used the work around Setfield.jl.
using StaticArrays, BenchmarkTools, Setfield
function copySized(n::Int64)
v = SizedVector{n, Int64}(zeros(n))
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copyStatic(n::Int64)
v = #SVector zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
#set v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
function copynormal(n::Int64)
v = zeros(n)
w = Vector{Int64}(undef, n)
for i in eachindex(v)
v[i] = i
end
for i in eachindex(v)
w[i] = v[i]
end
end
n = 10
#btime copySized($n)
#btime copyStatic($n)
#btime copynormal($n)
3.950 μs (42 allocations: 2.08 KiB)
5.417 μs (98 allocations: 4.64 KiB)
78.822 ns (2 allocations: 288 bytes)
Why does the case with SizedVector does have some much more allocations and hence worse performance? Do I not use SizedVector correctly? Should it not at least have the same performance as normal arrays?
Thank you in advance.
Cross post of Julia Discourse
I feel this is apples-to oranges comparison (and size should be store in statically in type). More illustrative code could look like this:
function copySized(::Val{n}) where n
v = SizedVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copyStatic(::Val{n}) where n
v = SVector{n}(1:n)
w = Vector{Int64}(undef, n)
w .= v
end
function copynormal(n)
v = [1:n;]
w = Vector{Int64}(undef, n)
w .= v
end
And now benchamrks:
julia> n = 10
10
julia> #btime copySized(Val{$n}());
248.138 ns (1 allocation: 144 bytes)
julia> #btime copyStatic(Val{$n}());
251.507 ns (1 allocation: 144 bytes)
julia> #btime copynormal($n);
77.940 ns (2 allocations: 288 bytes)
julia>
julia>
julia> n = 1000
1000
julia> #btime copySized(Val{$n}());
840.000 ns (2 allocations: 7.95 KiB)
julia> #btime copyStatic(Val{$n}());
830.769 ns (2 allocations: 7.95 KiB)
julia> #btime copynormal($n);
1.100 μs (2 allocations: 15.88 KiB)
#phipsgabler is right! Statically sized arrays have their performance advantages when the size is known statically, at compile time. My arrays are, however, dynamically sized, with the size n being a runtime variable.
Changing this yields more sensible results:
using StaticArrays, BenchmarkTools, Setfield
function copySized()
v = SizedVector{10, Float64}(zeros(10))
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copyStatic()
v = #SVector zeros(10)
w = Vector{Int64}(undef, 10*2)
for i in eachindex(v)
#set v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
function copynormal()
v = zeros(10)
w = Vector{Float64}(undef, 10*2)
for i in eachindex(v)
v[i] = rand()
end
for i in eachindex(v)
j = i+floor(Int64, 10/4)
w[j] = v[i]
end
end
#btime copySized()
#btime copyStatic()
#btime copynormal()
110.162 ns (3 allocations: 512 bytes)
48.133 ns (1 allocation: 224 bytes)
92.045 ns (2 allocations: 368 bytes)

Skip every nth element of array

How can I remove every nth element from an array in julia? Let's say I have the following array: a = [1 2 3 4 5 6] and I want b = [1 2 4 5]
In javascript I would do something like:
b = a.filter(e => e % 3);
How can it be done in Julia?
Your question title and text ask different questions. The title asks how to skip the Nth element, whereas the Javascript code snippet details how to skip elements based on their value, not their index.
Skipping by Value
We can do this using filter.
filter((x) -> x % 3 != 0, a)
This is basically equivalent to your Javascript code. We can, incidentally, also use broadcasting.
a[a .% 3 .!= 0]
This is more akin to code you would see in array-oriented languages like MATLAB and R.
Skipping by Index
With an extra enumerate call, we can get the indices to operate on.
map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate(a)))
This is roughly what you'd do in Python. enumerate to get the indices, filter to purge, then map to eliminate the now-unnecessary indices.
Or we can, again, use broadcasting.
a[(1:length(a)) .% 3 .!= 0]
If you need skipping by index the most elegant way is to use InvertedIndices
julia> using InvertedIndices # or using DataFrames
julia> a[Not(3:3:end)]
4-element Vector{Int64}:
1
2
4
5
As you can see all your job here is to provide a range of indices you wish to skip.
If you want to filter by the index, one convenient way is using a comprehension:
julia> a = 10:10:100;
julia> [a[i] for i in eachindex(a) if i % 3 != 0] |> permutedims
1×7 Matrix{Int64}:
10 20 40 50 70 80 100
julia> vec(ans) == [a[1 + 3(j-1)÷2] for j in 1:7]
true
This implicitly involves Iterators.filter, and collects the generator. You can also use this to filter by value, although the eager filter is probably more efficient:
julia> a = 1:10;
julia> [x for x in a if x%3!=0] |> permutedims
1×7 Matrix{Int64}:
1 2 4 5 7 8 10
Perhaps it's interesting to time all of these:
julia> using BenchmarkTools, InvertedIndices
julia> a = rand(1000); # filter by index
julia> i1 = #btime [$a[1 + 3(j-1)÷2] for j in 1:667];
373.162 ns (1 allocation: 5.38 KiB)
julia> i2 = #btime $a[eachindex($a) .% 3 .!= 0];
1.387 μs (4 allocations: 9.80 KiB)
julia> i3 = #btime [$a[i] for i in eachindex($a) if i % 3 != 0];
3.557 μs (11 allocations: 16.47 KiB)
julia> i4 = #btime map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate($a)));
4.202 μs (11 allocations: 16.47 KiB)
julia> i5 = #btime $a[Not(3:3:end)];
84.333 μs (4655 allocations: 182.28 KiB)
julia> i1 == i2 == i3 == i4 == i5
true
julia> a = rand(1:99, 1000); # filter by value
julia> v1 = #btime filter(x -> x%3!=0, $a);
532.185 ns (1 allocation: 7.94 KiB)
julia> v2 = #btime [x for x in $a if x%3!=0];
5.465 μs (11 allocations: 16.47 KiB)
julia> v1 == v2
true
This should help you:
b = a[Bool[i %3 != 0 for i = 1:length(a)]]
a[a .% 2 .!= 0]
please find the link with code.

MATLAB-style replacement of array values that meet certain condition in Julia [duplicate]

In Octave, I can do
octave:1> A = [1 2; 3 4]
A =
1 2
3 4
octave:2> A(A>1) -= 1
A =
1 1
2 3
but in Julia, the equivalent syntax does not work.
julia> A = [1 2; 3 4]
2x2 Array{Int64,2}:
1 2
3 4
julia> A[A>1] -= 1
ERROR: `isless` has no method matching isless(::Int64, ::Array{Int64,2})
in > at operators.jl:33
How do you conditionally assign values to certain array or matrix elements in Julia?
Your problem isn't with the assignment, per se, it's that A > 1 itself doesn't work. You can use the elementwise A .> 1 instead:
julia> A = [1 2; 3 4];
julia> A .> 1
2×2 BitArray{2}:
false true
true true
julia> A[A .> 1] .-= 1000;
julia> A
2×2 Array{Int64,2}:
1 -998
-997 -996
Update:
Note that in modern Julia (>= 0.7), we need to use . to say that we want to broadcast the action (here, subtracting by the scalar 1000) to match the size of the filtered target on the left. (At the time this question was originally asked, we needed the dot in A .> 1 but not in .-=.)
In Julia v1.0 you can use the replace! function instead of logical indexing, with considerable speedups:
julia> B = rand(0:20, 8, 2);
julia> #btime (A[A .> 10] .= 10) setup=(A=copy($B))
595.784 ns (11 allocations: 4.61 KiB)
julia> #btime replace!(x -> x>10 ? 10 : x, A) setup=(A=copy($B))
13.530 ns ns (0 allocations: 0 bytes)
For larger matrices, the difference hovers around 10x speedup.
The reason for the speedup is that the logical indexing solution relies on creating an intermediate array, while replace! avoids this.
A slightly terser way of writing it is
replace!(x -> min(x, 10), A)
There doesn't seem to be any speedup using min, though.
And here's another solution that is almost as fast:
A .= min.(A, 10)
and that also avoids allocations.
To make it work in Julia 1.0 one need to change = to .=. In other words:
julia> a = [1 2 3 4]
julia> a[a .> 1] .= 1
julia> a
1×4 Array{Int64,2}:
1 1 1 1
Otherwise you will get something like
ERROR: MethodError: no method matching setindex_shape_check(::Int64, ::Int64)

Resources