Apply function to pairs of columns in Julia - arrays

I have a pair of matrices, say Ws, Xs, of equal dimension and a function myFunc(w, x) which takes two vectors as input. I want to apply this function to pairs of columns (think of it as zip-ing the columns) and mapping this function to them.
Is there a non-iterative way to do this? If there were only two columns in each of Ws, Xs, I can do
allCols = permutedims(reshape(hcat(Ws, Xs), d, 2), [1, 3, 2])
mapslices(x -> myFunc(x[:, 1], x[:, 2]), allCols, dims=[1, 2])
but I'm having trouble moving to an arbitrary number of columns.
Edit: using vcat and the correct dimensions seems to fix this:
# assume d is column size
wxArray = reshape(vcat(Ws, Xs), 2, d) # group pairs of columns together
mapslices(x -> myFunc(x[:, 1], x[:, 2]), wxArray, dims=[1,2])

You can use eachcol function like this (I give three ways just to show different possible approaches but eachcol is crucial in all of them):
julia> Ws = rand(2,3)
2×3 Array{Float64,2}:
0.164036 0.233236 0.937968
0.724233 0.102248 0.55047
julia> Xs = rand(2,3)
2×3 Array{Float64,2}:
0.0493071 0.735849 0.643352
0.909295 0.276808 0.396145
julia> using LinearAlgebra
julia> dot.(eachcol(Ws), eachcol(Xs))
3-element Array{Float64,1}:
0.6666296397421881
0.19992972241709792
0.8215096642236619
julia> dot.(eachcol.((Ws, Xs))...)
3-element Array{Float64,1}:
0.6666296397421881
0.19992972241709792
0.8215096642236619
julia> map(dot, eachcol(Ws), eachcol(Xs))
3-element Array{Float64,1}:
0.6666296397421881
0.19992972241709792
0.8215096642236619
This requires Julia 1.1.
EDIT
If you are on Julia 1.0 and do want to avoid iteration while not mind some extra allocations (the solution above avoids allocations) you can also use cat function (which is a bit simpler than your approach I think):
julia> Ws = rand(2,3)
2×3 Array{Float64,2}:
0.975749 0.660932 0.391192
0.619872 0.278402 0.799096
julia> Xs = rand(2,3)
2×3 Array{Float64,2}:
0.0326003 0.272455 0.713046
0.389058 0.886105 0.950822
julia> mapslices(x -> (x[:,1], x[:,2]), cat(Ws, Xs; dims=3), dims=[1,3])[1,:,1]
3-element Array{Tuple{Array{Float64,1},Array{Float64,1}},1}:
([0.975749, 0.619872], [0.0326003, 0.389058])
([0.660932, 0.278402], [0.272455, 0.886105])
([0.391192, 0.799096], [0.713046, 0.950822])
of course you can also simply do this:
julia> map(i -> (Ws[:,i], Xs[:,i]), axes(Ws, 2))
3-element Array{Tuple{Array{Float64,1},Array{Float64,1}},1}:
([0.975749, 0.619872], [0.0326003, 0.389058])
([0.660932, 0.278402], [0.272455, 0.886105])
([0.391192, 0.799096], [0.713046, 0.950822])
or more fancy:
julia> (i -> (Ws[:,i], Xs[:,i])).(axes(Ws, 2))
3-element Array{Tuple{Array{Float64,1},Array{Float64,1}},1}:
([0.975749, 0.619872], [0.0326003, 0.389058])
([0.660932, 0.278402], [0.272455, 0.886105])
([0.391192, 0.799096], [0.713046, 0.950822])

Related

Iterate over vector of vectors of Strings without using for loops in Julia

Given a vector of vectors of strings, like:
sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
]
I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).
So far I've achieved this using a classic for loop:
number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []
for sent in sentences
append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
append!(total_tokens, length(sent))
end
collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))
output: (edited as per #shayan observation)
3-element Vector{Tuple{Any, Any, Any}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
This gets the job done, but it takes more time than I'd like (I have 6000+ documents, with thousands of sentences each...), and it looks a bit like an antipattern.
Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?
There's no reason to avoid loops for performance reasons in Julia. Loops are fast, and vectorized code is just loops in disguise.
Here's an example of doing this with loops, and some reductions, like all and count:
function wordstats(sentences)
out = Vector{NTuple{3, Int}}(undef, length(sentences))
for (i, sent) in pairs(sentences)
a = count(all(isletter, word) for word in sent)
b = count(length(word)>2 for word in sent)
c = length(sent)
out[i] = (a, b, c)
end
return out
end
The above code is not optimized, for example, counting words longer than 2 can be improved, but it runs in approximately 700ns on my laptop, which is much faster than the vectorized solution.
Edit: Here's basically the same code, but using the map do syntax (so you don't have to figure out the return type):
function wordstats2(sentences)
map(sentences) do sent
a = count(all(isletter, word) for word in sent)
b = count(length(word)>2 for word in sent)
c = length(sent)
return (a, b, c)
end
end
At first, I guess you have mistakes in writing the final results; for example, you wrote 7 for the number of total tokens in the first element of the sentences while it should be 6 actually.
You can follow such a procedure, fully vectorized:
julia> sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
];
julia> function check_all_letter(str::String)
all(isletter, str)
end
check_all_letter (generic function with 1 method)
julia> all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
3-element Vector{Vector{String}}:
["Julia", "is", "faster", "than"]
["Julia", "reads"]
["Python", "has", "times", "more", "libraries"]
julia> length.(a)
3-element Vector{Int64}:
4
2
5
I can make a similar procedure for number_of_long_words and total_tokens. Wrapping all of it in a function, I'll have:
julia> function arbitrary_name(vec::Vector{Vector{String}})
all_letters = map(x->filter(check_all_letter, x), sentences)
long_words = map(x->filter(y->length.(y).>2, x), sentences)
total_tokens = length.(sentences)
return collect(zip( length.(all_letters),
length.(long_words),
total_tokens
)
)
end
arbitrary_name (generic function with 1 methods)
julia> arbitrary_name(sentences)
3-element Vector{Tuple{Int64, Int64, Int64}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
Additional explanation
When I write something like length.(y).>2, In fact, I'm trying to kinda chain some julia functions through vectorization. Consider this example to fully understand what is happening through length.(y).>2:
julia> vec = ["foo", "bar", "baz"];
julia> lengths = length.(vec)
3-element Vector{Int64}:
3
3
3
julia> more_than_two = lengths .> 2
3-element BitVector:
1
1
1
# This is exactly equal to this:
julia> length.(vec).>2
3-element BitVector:
1
1
1
# Or
julia> vec .|> length .|> x->~isless(x, 2)
3-element BitVector:
1
1
1
I hope this help #fandak 🧡. I refer you to official doc for further explanation about broadcasting and chaining functions.

How to export/import an array in Julia?

I want to move an array from my laptop (Julia 1.3.1) to my desktop PC (Julia 1.6.2).
I make an array in Julia 1.3.1 as follows.
using LinearAlgebra
H = ... #give a matrix H
eigen,vector = eigen(H)
Then, I'd like to move "vector" to Julia 1.6.2.
How do you do that?
The simplest way is by using DelimitedFiles:
julia> v = [1.0,2.0,3.0]
julia> using DelimitedFiles
julia> writedlm("f.txt", v)
julia> readdlm("f.txt")
3×1 Matrix{Float64}:
1.0
2.0
3.0
julia> vec(readdlm("f.txt"))
3-element Vector{Float64}:
1.0
2.0
3.0
Note that DelmitedFiles works with matrices so the last example shows what to do if you rather store a vector.
Edit following Bogumil's comment
When you have a Matrix of Complex numbers you need to provide the output type for readdlm:
julia> v = Complex.(rand(2,3), rand(2,3))
2×3 Matrix{ComplexF64}:
0.282157+0.540556im 0.757765+0.103518im 0.979935+0.212347im
0.557499+0.934859im 0.604032+0.338489im 0.431962+0.945946im
julia> writedlm("f.txt", v)
julia> readdlm("f.txt",'\t',Complex{Float64})
2×3 Matrix{ComplexF64}:
0.282157+0.540556im 0.757765+0.103518im 0.979935+0.212347im
0.557499+0.934859im 0.604032+0.338489im 0.431962+0.945946im
julia> readdlm("f.txt",'\t',Complex{Float64}) == v
true
Another way is to use a binary format. For long term in-between version serialization BSON (binary json) could be a good option:
julia> using BSON
julia> BSON.bson("v.bson", v = v)
julia> v2 = BSON.load("v.bson")[:v]
2×3 Matrix{ComplexF64}:
0.282157+0.540556im 0.757765+0.103518im 0.979935+0.212347im
0.557499+0.934859im 0.604032+0.338489im 0.431962+0.945946im

Julia: rational behind array size and index for "extra" dimensions?

I am using Julia from time to time, however I am surprised by the following behavior:
Let's define an 3x4 array
julia> m=rand(3,4)
3×4 Array{Float64,2}:
0.889018 0.500847 0.539856 0.828231
0.492425 0.582958 0.521406 0.754102
0.28227 0.834333 0.669967 0.0939701
Now I check that
julia> size(m,1), size(m,2)
(3, 4)
as expected.
However, I am surprised by this:
julia> size(m,3), size(m,2018)
(1, 1)
-> I would have expected (0,0) or an error message
Looking the Julia code confirms this behavior:
size(t::AbstractArray{T,N}, d) where {T,N} = d <= N ? size(t)[d] : 1
Moreover:
julia> m[2,1,1,1,1]
0.4924252391289974
-> I would have expected an out of bounds error
So my question is: "what is the rationale?"
( I do not thing it is a bug, I use Julia version 0.6.2)
I believe it's for broadcasting.
julia> m=rand(3,4)
3×4 Array{Float64,2}:
0.139323 0.663912 0.994985 0.517332
0.423913 0.121753 0.0327054 0.0754665
0.392672 0.47006 0.351121 0.787318
julia> size(m)
(3, 4)
julia> n = rand(3)
3-element Array{Float64,1}:
0.716752
0.98755
0.661226
julia> m .* n
3×4 Array{Float64,2}:
0.09986 0.475861 0.713157 0.370799
0.418636 0.120237 0.0322983 0.074527
0.259645 0.310816 0.23217 0.520595
Notice that n is of one dimension less, so it's size 1 in the 2nd dimension and thus applies column-wise. Scalars in broadcast are treated differently and are generally inlined into the fused broadcasting function which you cannot do with a mutable type, so the size 1 = expand in higher dimensions rule for broadcast is a nice way to implement this.

Override Show Overload Due to Subtyping

I have
type MyArray{T,N} <: AbstractArray{T,N}
x::Array{T,N}
y::Int
end
It prints like an array. However, I would like its show/print/display/Juno render to act like it's just any ol' type. Is there a good way to remove the overrides without dropping the AbstractArray subtyping?
Here's a way to restore the standard Base.show behavior for a type, using invoke:
julia> type MyArray{T,N} <: AbstractArray{T,N}
x::Array{T,N}
y::Int
end
julia> Base.show(io::IO, A::MyArray) =
invoke(show, Tuple{typeof(io), Any}, io, A)
julia> Base.show(io::IO, ::MIME"text/plain", A::MyArray) = show(io, A)
julia> MyArray([1, 2, 3], 4)
MyArray{Int64,1}([1, 2, 3], 4)
I don't know if this handles the Juno part; apparently Juno uses its own infrastructure.
Not sure how one might generically 'restore' the default show function, but this is easy enough to emulate:
julia> type MyArray{T,N} <: AbstractArray{T,N}
x::Array{T,N}
y::Int
end
julia> Base.show(io::IO, a::MyArray) = print(io, "$(typeof(a))($(a.x), $(a.y))");
julia> Base.show(io::IO, ::MIME"text/plain", a::MyArray) = show(io, a);
julia> a = MyArray([1., 2., 3., 4., 5.], 5)
MyArray{Float64,1}([1.0, 2.0, 3.0, 4.0, 5.0], 5)
As an aside, personally I find dump to work a lot better as a multi-line 'display' function than the default one for such an array-containing type:
julia> Base.show(io::IO, ::MIME"text/plain", a::MyArray) = dump(a);
julia> a
MyArray{Float64,1}
x: Array{Float64}((5,)) [1.0, 2.0, 3.0, 4.0, 5.0]
y: Int64 5

Fortran-like arrays such as FArray(Float64, -1:1,-7:7,-128:512) in Julia

Generally having 1-based array for Julia is a good decision, but sometimes it is desirable to have Fortran-like array with indices that span some subranges of ℤ:
julia> x = FArray(Float64, -1:1,-7:7,-128:512)
where it would be useful:
in the codes accompanying the book Numerical Solution of Hyperbolic Partial Differential Equations by prof. John A. Trangenstein these negative indices are used intensively for ghost cells for boundary conditions.
The same is true for Clawpack (stands for “Conservation Laws Package”) by prof. Randall J. LeVeque http://depts.washington.edu/clawpack/ and there are many other codes where such indices would be natural.
So such auxiliary class would be useful for speedy translation of such codes.
I just started to implement such an auxiliary type but as I'm quite new to Julia your help would be greatly appreciated.
I started with:
type FArray
ranges
array::Array
function FArray(T, r::Range1{Int}...)
dims = map((x) -> length(x), r)
array = Array(T, dims)
new(r, array)
end
end
Output:
julia> include ("FortranArray.jl")
julia> x = FArray(Float64, -1:1,-7:7,-128:512)
FArray((-1:1,-7:7,-128:512),3x15x641 Array{Float64,3}:
[:, :, 1] =
6.90321e-310 2.6821e-316 1.96042e-316 0.0 0.0 0.0 9.84474e-317 … 1.83233e-316 2.63285e-316 0.0 9.61618e-317 0.0
6.90321e-310 6.32404e-322 2.63285e-316 0.0 0.0 0.0 2.63292e-316 2.67975e-316
...
[:, :, 2] =
...
As I'm completely new to Julia any recommendations especially that lead to more efficient would be greatly appreciated.
[Edit]
The topic has been discussed here:
https://groups.google.com/forum/#!topic/julia-dev/NOF6MA6tb9Y
During the discussion two ways to have Julia arrays with arbitrary base were elaborated:
SubArray-based, sample usage is with a helper function:
function farray(T, r::Range1{Int64}...)
dims = map((x) -> length(x), r)
array = Array(T, dims)
sub_indices = map((x) -> -minimum(x) + 2 : maximum(x), r)
sub(array, sub_indices)
end
julia> y[-1,-7,-128] = 777
777
julia> y[-1,-7,-128] + 33
810.0
julia> y[-2,-7,-128]
ERROR: BoundsError()
in getindex at subarray.jl:200
julia> y[2,-7,-128]
2.3977385e-316
Please note, that bounds are not checked fully more details are here:
https://github.com/JuliaLang/julia/issues/4044
At the moment SubArray has performance issues but eventually its performance might be improved, see also:
https://github.com/JuliaLang/julia/issues/5117
https://github.com/JuliaLang/julia/issues/3496
Another approach that has better performance at the moment, besides checks both bounds:
type FArray{T<:Number, N, A<:AbstractArray} <: AbstractArray
ranges
offsets::NTuple{N,Int}
array::A
function FArray(r::Range1{Int}...)
dims = map((x) -> length(x), r)
array = Array(T, dims)
offs = map((x) -> 1 - minimum(x), r)
new(r, offs, array)
end
end
FArray(T, r::Range1{Int}...) = FArray{T, length(r,), Array{T, length(r,)}}(r...)
getindex{T<:Number}(FA::FArray{T}, i1::Int) = FA.array[i1+FA.offsets[1]]
getindex{T<:Number}(FA::FArray{T}, i1::Int, i2::Int) = FA.array[i1+FA.offsets[1], i2+FA.offsets[2]]
getindex{T<:Number}(FA::FArray{T}, i1::Int, i2::Int, i3::Int) = FA.array[i1+FA.offsets[1], i2+FA.offsets[2], i3+FA.offsets[3]]
setindex!{T}(FA::FArray{T}, x, i1::Int) = arrayset(FA.array, convert(T,x), i1+FA.offsets[1])
setindex!{T}(FA::FArray{T}, x, i1::Int, i2::Int) = arrayset(FA.array, convert(T,x), i1+FA.offsets[1], i2+FA.offsets[2])
setindex!{T}(FA::FArray{T}, x, i1::Int, i2::Int, i3::Int) = arrayset(FA.array, convert(T,x), i1+FA.offsets[1], i2+FA.offsets[2], i3+FA.offsets[3])
getindex and setindex! methods for FArray were inspired by base/array.jl code.
Use cases:
julia> y = FArray(Float64, -1:1,-7:7,-128:512);
julia> y[-1,-7,-128] = 777
777
julia> y[-1,-7,-128] + 33
810.0
julia> y[-1,2,3]
0.0
julia> y[-2,-7,-128]
ERROR: BoundsError()
in getindex at FortranArray.jl:27
julia> y[2,-7,-128]
ERROR: BoundsError()
in getindex at FortranArray.jl:27
There are now two packages that provide this kind of functionality. For arrays with arbitrary start indices, see https://github.com/alsam/OffsetArrays.jl. For even more flexibility see https://github.com/mbauman/AxisArrays.jl, where indices do not have to be integers.

Resources