Julia pairwise broadcast - arrays

I would like to compare every pair of strings in a list of strings in Julia. One way to do it is
equal_strs = [(x == y) for x in str_list, y in str_list]
However, if I use broadcast as follows:
equal_strs = broadcast(==, str_list, str_list)
it returns a vector instead of a 2D array. Is there a way to output a 2D array using broadcast?

Broadcasting works by expanding ("broadcasting") dimensions that do not have the same length, in a way such that an array with (for example) size Nx1xM broadcasted with a NxKx1 gives an NxKxM array.
This means that if you broadcast an operation with to length N vectors, you will get a length N vector.
So you need one string array to be a length N vector, and the other an 1xM matrix:
julia> using Random
julia> str1 = [randstring('A':'C', 3) for _ in 1:5]
5-element Vector{String}:
"ACC"
"CBC"
"AAC"
"CAB"
"BAB"
1.8.0> str2 = [randstring('A':'C', 3) for _ in 1:4]
4-element Vector{String}:
"ABB"
"BAB"
"CAA"
"BBC"
1.8.0> str1 .== permutedims(str2)
5×4 BitMatrix:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
permutedims will change a length N vector into a 1xN matrix.
BTW, you would very rarely use broadcast in your code (broadcast(==, a, b)), instead, use the dot syntax, a .== b, which is more idiomatic.

You should have one vector transposed for the broadcasting machinery to build a matrix by expanding dimensions of the inputs to agree.
julia> str_list = ["hello", "car", "me", "good", "people", "good"];
julia> equal_strs = broadcast(==, str_list, permutedims(str_list))
6×6 BitMatrix:
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 1
0 0 0 0 1 0
0 0 0 1 0 1
Also, the following are similar.
equal_strs = str_list .== permutedims(str_list)
equal_strs = isequal.(str_list, permutedims(str_list))

Will assume that by "list" you mean a Vector as there are no python-like lists in Julia. If you meant a tuple I would suggest converting it into a Vector anyway because broadcasting is best used with Arrays (which Vector is a subtype of).
str_list = ["one", "two", "three", "one", "two"]
Now you simply do
broadcast(==, str_list, permutedims(str_list))
or more concise with the dot operator
str_list .== permutedims(str_list)
What happens under the hood:
broadcasting in Julia works element-wise, so if you have 2 Vectors it will not do anything as the dimensions match.
But if you have a Vector and a Matrix (Vector is a 1D Array, and Matrix is a 2D array) with the shapes of (N,1) and (1,N) Julia will broadcast the 1 dimension giving you a Matrix of shape (N,N) which is what you want.
Now usually with numbers you would do ' instead of permutedims
num_list .== num_list'
as to why it doesn't work with strings see this answer.

lst .== permutedims(lst) is a perfectly good method to find the result, as suggested by other answers. But it takes O(n^2) comparisons, and if the list is long, it might be better to use an O(n*log(n)) comparisons algorithm. Following is an implementation of such as algorithm with a little benchmark:
function equal_str(lst)
sp = sortperm(lst)
isp = invperm(sp)
same = [i==1 ? false : lst[sp[i]]==lst[sp[i-1]] for i=1:length(lst)]
ac = accumulate((k,v)-> ifelse(v==false, k+1, k), same; init=0)
return [ ac[isp[i]]==ac[isp[j]] for i=1:length(lst),j=1:length(lst) ]
end
and the benchmark gives:
julia> using Random
julia> using BenchmarkTools
julia> lst = [randstring('A':'C',3) for i=1:40];
julia> show(lst)
["CBA", "CAB", "BCA", "AAC", "AAA", "ABC", "BBA", "CAB", "CBC", "CCA",
"BCC", "BCB", "CAB", "BCB", "ACC", "CBC", "CCC", "CCB", "BCB", "BCB",
"ABA", "AAC", "CCC", "ABC", "BAC", "CAB", "BAB", "BCB", "CCA", "CAC",
"AAA", "BBC", "ABC", "BCB", "CBA", "CAA", "CAB", "CAC", "CBC", "CBC"]
julia> #btime $lst .== permutedims($lst) ;
9.025 μs (5 allocations: 4.58 KiB)
julia> #btime equal_str($lst) ;
6.112 μs (8 allocations: 3.08 KiB)
The larger the lst the bigger the difference would be. This applies only to comparing a list with itself, as the OP suggests. To compare two lists, a different algorithm should be employed for O(n*log(n)) time.
Finally, even this algorithm works a little too hard by sorting, but an O(n^2) time/space complexity is inherent in producing the result.
UPDATE:
A more linear O(n) time calculation (still O(n^2) to make matrix):
function equal_str_2(lst)
d = Dict{String,Int}()
d2 = Dict{Int, Vector{Int}}()
for p in pairs(lst)
if haskey(d,p[2])
push!(d2[d[p[2]]],p[1])
else
d[p[2]] = p[1]
d2[p[1]] = [p[1]]
end
end
res = zeros(Bool, (length(lst), length(lst)))
for p in values(d2)
for q in Iterators.product(p,p)
res[q[1],q[2]] = true
res[q[2], q[1]] = true
end
end
return res
end
and benchmark with larger lst:
julia> lst = [randstring('A':'C',3) for i=1:140];
julia> #btime $lst .== permutedims($lst) ;
99.094 μs (5 allocations: 6.89 KiB)
julia> #btime equal_str($lst) ;
51.981 μs (9 allocations: 23.12 KiB)
julia> #btime equal_str_2($lst) ;
21.539 μs (72 allocations: 27.47 KiB)

Related

How to convert string to array with no spaces

Related:
How to convert from string to array?
This is a follow-up question. How would I make a list of all the digits in this number (currently as a string)?
"123" -> [1,2,3]
There are no delimiters here so how should I go about doing this?
Note as of now I am using the latest version of Julia, v1.8.3 so parse doesn't seem to work in the other question's answers. Error when I use parse():
ERROR: LoadError: MethodError: no method matching parse(::SubString{String})
Closest candidates are:
parse(::Type{T}, ::AbstractString) where T<:Complex at parse.jl:381
parse(::Type{Sockets.IPAddr}, ::AbstractString) at ~/usr/share/julia/stdlib/v1.8/Sockets/src/IPAddr.jl:246
parse(::Type{T}, ::AbstractChar; base) where T<:Integer at parse.jl:40
...
Stacktrace:
[1] iterate
# ./generator.jl:47 [inlined]
[2] _collect
# ./array.jl:807 [inlined]
[3] collect_similar
# ./array.jl:716 [inlined]
[4] map
# ./abstractarray.jl:2933 [inlined]
[5] top-level scope
# ~/proc/self/fd/0:1
in expression starting at /proc/self/fd/0:1
exit status 1
Easy peasy like this:
function str2vec(s::String)
return map(x->parse(Int,x), split(s,""))
end
julia> str2vec("124")
3-element Vector{Int64}:
1
2
4
Or by broadcasting:
julia> parse.(Int, split("124",""))
3-element Vector{Int64}:
1
2
4
By piping functions:
julia> "124" |> x->split(x, "") |> x->parse.(Int, x)
3-element Vector{Int64}:
1
2
4
Utilizing the eachsplit function, which is a lazy function and returns a generator object (introduced in Julia 1.8):
julia> eachsplit("124", "") |> x->parse.(Int, x)
3-element Vector{Int64}:
1
2
4
According to Dan's advice, you try another ways:
Using the Int8 on the collected chars:
julia> Int8.(collect("124")).-48
3-element Vector{Int64}:
1
2
4
Using the Iterators.map:
julia> collect(Iterators.map(x->Int8(x)-48,"124"))
3-element Vector{Int64}:
1
2
4
Also, one can consider the DNF's proposal:
julia> [Int(x)-48 for x in "124"]
3-element Vector{Int64}:
1
2
4
Benchmarking
julia> using BenchmarkTools
julia> #btime str2vec("124");
#btime parse.(Int, split("124",""));
#btime "124" |> x->split(x, "") |> x->parse.(Int, x);
#btime eachsplit("124", "") |> x->parse.(Int, x);
#btime Int8.(collect("124")).-48;
#btime collect(Iterators.map(x->Int8(x)-48,"123"));
#btime [Int(x)-48 for x in "123"]
681.250 ns (11 allocations: 864 bytes)
675.460 ns (11 allocations: 864 bytes)
679.747 ns (11 allocations: 864 bytes)
1.280 μs (14 allocations: 816 bytes)
92.412 ns (2 allocations: 160 bytes)
61.711 ns (1 allocation: 80 bytes)
45.152 ns (1 allocation: 80 bytes)
You can also use the inbuilt digits function.
By default, it returns the digits last-to-first:
julia> digits(parse(Int, "1234"))
4-element Vector{Int64}:
4
3
2
1
You can reverse! the result if you want them in the same order as in the string:
julia> digits(parse(Int, "1234")) |> reverse!
4-element Vector{Int64}:
1
2
3
4
This runs much faster than parseing each digit individually. The Int8(...) .- 48 method is still faster, but it fails silently if the input string happens to be invalid, which could be dangerous further down the line. Since we're using parse here, this method reports the error correctly in such cases.
julia> Int8.(collect("invalid")).-48
7-element Vector{Int64}:
57
62
70
49
60
57
52
julia> digits(parse(Int, "invalid")) |> reverse!
ERROR: ArgumentError: invalid base 10 digit 'i' in "invalid"
Both other answers are very good, but they have forgotten about comprehensions. Using a comprehension gives both the fastest safe solution, and the absolute fastest solution, tied with the Iterators.map.
Fastest unsafe (based on the answer by #Shayan with input from #DanGetz):
julia> #btime [Int(c)-48 for c in "123"]
34.372 ns (1 allocation: 80 bytes)
3-element Vector{Int64}:
1
2
3
The above will silently return the wrong answer for invalid inputs, as noted by #SundarR.
Here's an even nicer and more intuitive version of the above, which is the same under the hood:
[c - '0' for c in "123"]
It works because Int('0') equals 48, and subtraction of Chars yields an Int.
Fastest safe solution (based on #SundarR's answer):
julia> #btime [parse(Int, c) for c in "123"]
47.822 ns (1 allocation: 80 bytes)
3-element Vector{Int64}:
1
2
3
julia> [parse(Int, c) for c in "invalid"]
ERROR: ArgumentError: invalid base 10 digit 'i'
I would probably recommend the latter in most cases.
One more thing you may or may not be aware of: You can create a generator instead of a vector, in case you don't actually need the vector itself, but want to iterate over the converted numbers for some other purpose. The syntax is almost identical to an array comprehension, just use () instead:
g = (parse(Int, c) for c in "123")
for val in g
println(val, " squared equals ", val^2)
end
1 squared equals 1
2 squared equals 4
3 squared equals 9
This will not allocate an intermediate temporary vector, and creating the generator is essentially free:
julia> #btime (parse(Int, c) for c in "123")
1.900 ns (0 allocations: 0 bytes)
The computational cost is paid during iteration instead. This is similar to using Iterators.map without collect, but arguably has nicer syntax.

How to count matches in two arrays?

If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3) of the three matching elements 3,4,and 5.
You can use intersect:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3
The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m + n) linear complexity. So, the overall algorithm complexity can be O(n log n).
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i += 1
j += 1
matches += 1
elseif x[i] < y[j]
i += 1
else
j += 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000 arrays, we get:
#btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)
A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
#btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.

Skip every nth element of array

How can I remove every nth element from an array in julia? Let's say I have the following array: a = [1 2 3 4 5 6] and I want b = [1 2 4 5]
In javascript I would do something like:
b = a.filter(e => e % 3);
How can it be done in Julia?
Your question title and text ask different questions. The title asks how to skip the Nth element, whereas the Javascript code snippet details how to skip elements based on their value, not their index.
Skipping by Value
We can do this using filter.
filter((x) -> x % 3 != 0, a)
This is basically equivalent to your Javascript code. We can, incidentally, also use broadcasting.
a[a .% 3 .!= 0]
This is more akin to code you would see in array-oriented languages like MATLAB and R.
Skipping by Index
With an extra enumerate call, we can get the indices to operate on.
map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate(a)))
This is roughly what you'd do in Python. enumerate to get the indices, filter to purge, then map to eliminate the now-unnecessary indices.
Or we can, again, use broadcasting.
a[(1:length(a)) .% 3 .!= 0]
If you need skipping by index the most elegant way is to use InvertedIndices
julia> using InvertedIndices # or using DataFrames
julia> a[Not(3:3:end)]
4-element Vector{Int64}:
1
2
4
5
As you can see all your job here is to provide a range of indices you wish to skip.
If you want to filter by the index, one convenient way is using a comprehension:
julia> a = 10:10:100;
julia> [a[i] for i in eachindex(a) if i % 3 != 0] |> permutedims
1×7 Matrix{Int64}:
10 20 40 50 70 80 100
julia> vec(ans) == [a[1 + 3(j-1)÷2] for j in 1:7]
true
This implicitly involves Iterators.filter, and collects the generator. You can also use this to filter by value, although the eager filter is probably more efficient:
julia> a = 1:10;
julia> [x for x in a if x%3!=0] |> permutedims
1×7 Matrix{Int64}:
1 2 4 5 7 8 10
Perhaps it's interesting to time all of these:
julia> using BenchmarkTools, InvertedIndices
julia> a = rand(1000); # filter by index
julia> i1 = #btime [$a[1 + 3(j-1)÷2] for j in 1:667];
373.162 ns (1 allocation: 5.38 KiB)
julia> i2 = #btime $a[eachindex($a) .% 3 .!= 0];
1.387 μs (4 allocations: 9.80 KiB)
julia> i3 = #btime [$a[i] for i in eachindex($a) if i % 3 != 0];
3.557 μs (11 allocations: 16.47 KiB)
julia> i4 = #btime map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate($a)));
4.202 μs (11 allocations: 16.47 KiB)
julia> i5 = #btime $a[Not(3:3:end)];
84.333 μs (4655 allocations: 182.28 KiB)
julia> i1 == i2 == i3 == i4 == i5
true
julia> a = rand(1:99, 1000); # filter by value
julia> v1 = #btime filter(x -> x%3!=0, $a);
532.185 ns (1 allocation: 7.94 KiB)
julia> v2 = #btime [x for x in $a if x%3!=0];
5.465 μs (11 allocations: 16.47 KiB)
julia> v1 == v2
true
This should help you:
b = a[Bool[i %3 != 0 for i = 1:length(a)]]
a[a .% 2 .!= 0]
please find the link with code.

Julia: A fast and elegant way to get a matrix from an array of arrays

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:
v = [[rand(),rand()], ..., [rand(),rand()]]
I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
w = hcat(x,y)
The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?
I hope this is short and efficient enough for you:
getindex.(v, [1 2])
and if you want something simpler to digest:
[v[i][j] for i in 1:length(v), j in 1:2]
Also the hcat solution could be written as:
permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));
and it should not hang your Julia (please confirm - it works for me).
#Antonello: to understand why this works consider a simpler example:
julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
"a1" "a2"
"b1" "b2"
"c1" "c2"
I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?
Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by #BogumilKaminski, but is faster:
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
You can simplify it a bit further, without losing performance, like this:
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
A benchmark of the various solutions posted so far...
using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]
M1 = #btime vcat([[e[1] e[2]] for e in $v]...)
M2 = #btime getindex.($v, [1 2])
M3 = #btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = #btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = #btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))
function original(v)
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
return hcat(x,y)
end
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
M6 = #btime original($v)
M7 = #btime mat($v)
M8 = #btime mat($v)
M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true
Output:
1.126 ms (10010 allocations: 1.53 MiB) # M1
54.161 μs (3 allocations: 156.42 KiB) # M2
809.000 μs (38983 allocations: 765.50 KiB) # M3
98.935 μs (4 allocations: 312.66 KiB) # M4
244.696 μs (10 allocations: 469.23 KiB) # M5
219.907 μs (30 allocations: 669.61 KiB) # M6
34.311 μs (2 allocations: 156.33 KiB) # M7
34.395 μs (2 allocations: 156.33 KiB) # M8
Note that the dollar sign in the benchmarked code is just to force #btime to consider the vector as a local variable.

MATLAB-style replacement of array values that meet certain condition in Julia [duplicate]

In Octave, I can do
octave:1> A = [1 2; 3 4]
A =
1 2
3 4
octave:2> A(A>1) -= 1
A =
1 1
2 3
but in Julia, the equivalent syntax does not work.
julia> A = [1 2; 3 4]
2x2 Array{Int64,2}:
1 2
3 4
julia> A[A>1] -= 1
ERROR: `isless` has no method matching isless(::Int64, ::Array{Int64,2})
in > at operators.jl:33
How do you conditionally assign values to certain array or matrix elements in Julia?
Your problem isn't with the assignment, per se, it's that A > 1 itself doesn't work. You can use the elementwise A .> 1 instead:
julia> A = [1 2; 3 4];
julia> A .> 1
2×2 BitArray{2}:
false true
true true
julia> A[A .> 1] .-= 1000;
julia> A
2×2 Array{Int64,2}:
1 -998
-997 -996
Update:
Note that in modern Julia (>= 0.7), we need to use . to say that we want to broadcast the action (here, subtracting by the scalar 1000) to match the size of the filtered target on the left. (At the time this question was originally asked, we needed the dot in A .> 1 but not in .-=.)
In Julia v1.0 you can use the replace! function instead of logical indexing, with considerable speedups:
julia> B = rand(0:20, 8, 2);
julia> #btime (A[A .> 10] .= 10) setup=(A=copy($B))
595.784 ns (11 allocations: 4.61 KiB)
julia> #btime replace!(x -> x>10 ? 10 : x, A) setup=(A=copy($B))
13.530 ns ns (0 allocations: 0 bytes)
For larger matrices, the difference hovers around 10x speedup.
The reason for the speedup is that the logical indexing solution relies on creating an intermediate array, while replace! avoids this.
A slightly terser way of writing it is
replace!(x -> min(x, 10), A)
There doesn't seem to be any speedup using min, though.
And here's another solution that is almost as fast:
A .= min.(A, 10)
and that also avoids allocations.
To make it work in Julia 1.0 one need to change = to .=. In other words:
julia> a = [1 2 3 4]
julia> a[a .> 1] .= 1
julia> a
1×4 Array{Int64,2}:
1 1 1 1
Otherwise you will get something like
ERROR: MethodError: no method matching setindex_shape_check(::Int64, ::Int64)

Resources