How to count matches in two arrays? - arrays

If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3) of the three matching elements 3,4,and 5.

You can use intersect:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3

The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m + n) linear complexity. So, the overall algorithm complexity can be O(n log n).
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i += 1
j += 1
matches += 1
elseif x[i] < y[j]
i += 1
else
j += 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000 arrays, we get:
#btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)

A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
#btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
#btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.

Related

Julia pairwise broadcast

I would like to compare every pair of strings in a list of strings in Julia. One way to do it is
equal_strs = [(x == y) for x in str_list, y in str_list]
However, if I use broadcast as follows:
equal_strs = broadcast(==, str_list, str_list)
it returns a vector instead of a 2D array. Is there a way to output a 2D array using broadcast?
Broadcasting works by expanding ("broadcasting") dimensions that do not have the same length, in a way such that an array with (for example) size Nx1xM broadcasted with a NxKx1 gives an NxKxM array.
This means that if you broadcast an operation with to length N vectors, you will get a length N vector.
So you need one string array to be a length N vector, and the other an 1xM matrix:
julia> using Random
julia> str1 = [randstring('A':'C', 3) for _ in 1:5]
5-element Vector{String}:
"ACC"
"CBC"
"AAC"
"CAB"
"BAB"
1.8.0> str2 = [randstring('A':'C', 3) for _ in 1:4]
4-element Vector{String}:
"ABB"
"BAB"
"CAA"
"BBC"
1.8.0> str1 .== permutedims(str2)
5×4 BitMatrix:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
permutedims will change a length N vector into a 1xN matrix.
BTW, you would very rarely use broadcast in your code (broadcast(==, a, b)), instead, use the dot syntax, a .== b, which is more idiomatic.
You should have one vector transposed for the broadcasting machinery to build a matrix by expanding dimensions of the inputs to agree.
julia> str_list = ["hello", "car", "me", "good", "people", "good"];
julia> equal_strs = broadcast(==, str_list, permutedims(str_list))
6×6 BitMatrix:
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 1
0 0 0 0 1 0
0 0 0 1 0 1
Also, the following are similar.
equal_strs = str_list .== permutedims(str_list)
equal_strs = isequal.(str_list, permutedims(str_list))
Will assume that by "list" you mean a Vector as there are no python-like lists in Julia. If you meant a tuple I would suggest converting it into a Vector anyway because broadcasting is best used with Arrays (which Vector is a subtype of).
str_list = ["one", "two", "three", "one", "two"]
Now you simply do
broadcast(==, str_list, permutedims(str_list))
or more concise with the dot operator
str_list .== permutedims(str_list)
What happens under the hood:
broadcasting in Julia works element-wise, so if you have 2 Vectors it will not do anything as the dimensions match.
But if you have a Vector and a Matrix (Vector is a 1D Array, and Matrix is a 2D array) with the shapes of (N,1) and (1,N) Julia will broadcast the 1 dimension giving you a Matrix of shape (N,N) which is what you want.
Now usually with numbers you would do ' instead of permutedims
num_list .== num_list'
as to why it doesn't work with strings see this answer.
lst .== permutedims(lst) is a perfectly good method to find the result, as suggested by other answers. But it takes O(n^2) comparisons, and if the list is long, it might be better to use an O(n*log(n)) comparisons algorithm. Following is an implementation of such as algorithm with a little benchmark:
function equal_str(lst)
sp = sortperm(lst)
isp = invperm(sp)
same = [i==1 ? false : lst[sp[i]]==lst[sp[i-1]] for i=1:length(lst)]
ac = accumulate((k,v)-> ifelse(v==false, k+1, k), same; init=0)
return [ ac[isp[i]]==ac[isp[j]] for i=1:length(lst),j=1:length(lst) ]
end
and the benchmark gives:
julia> using Random
julia> using BenchmarkTools
julia> lst = [randstring('A':'C',3) for i=1:40];
julia> show(lst)
["CBA", "CAB", "BCA", "AAC", "AAA", "ABC", "BBA", "CAB", "CBC", "CCA",
"BCC", "BCB", "CAB", "BCB", "ACC", "CBC", "CCC", "CCB", "BCB", "BCB",
"ABA", "AAC", "CCC", "ABC", "BAC", "CAB", "BAB", "BCB", "CCA", "CAC",
"AAA", "BBC", "ABC", "BCB", "CBA", "CAA", "CAB", "CAC", "CBC", "CBC"]
julia> #btime $lst .== permutedims($lst) ;
9.025 μs (5 allocations: 4.58 KiB)
julia> #btime equal_str($lst) ;
6.112 μs (8 allocations: 3.08 KiB)
The larger the lst the bigger the difference would be. This applies only to comparing a list with itself, as the OP suggests. To compare two lists, a different algorithm should be employed for O(n*log(n)) time.
Finally, even this algorithm works a little too hard by sorting, but an O(n^2) time/space complexity is inherent in producing the result.
UPDATE:
A more linear O(n) time calculation (still O(n^2) to make matrix):
function equal_str_2(lst)
d = Dict{String,Int}()
d2 = Dict{Int, Vector{Int}}()
for p in pairs(lst)
if haskey(d,p[2])
push!(d2[d[p[2]]],p[1])
else
d[p[2]] = p[1]
d2[p[1]] = [p[1]]
end
end
res = zeros(Bool, (length(lst), length(lst)))
for p in values(d2)
for q in Iterators.product(p,p)
res[q[1],q[2]] = true
res[q[2], q[1]] = true
end
end
return res
end
and benchmark with larger lst:
julia> lst = [randstring('A':'C',3) for i=1:140];
julia> #btime $lst .== permutedims($lst) ;
99.094 μs (5 allocations: 6.89 KiB)
julia> #btime equal_str($lst) ;
51.981 μs (9 allocations: 23.12 KiB)
julia> #btime equal_str_2($lst) ;
21.539 μs (72 allocations: 27.47 KiB)

Skip every nth element of array

How can I remove every nth element from an array in julia? Let's say I have the following array: a = [1 2 3 4 5 6] and I want b = [1 2 4 5]
In javascript I would do something like:
b = a.filter(e => e % 3);
How can it be done in Julia?
Your question title and text ask different questions. The title asks how to skip the Nth element, whereas the Javascript code snippet details how to skip elements based on their value, not their index.
Skipping by Value
We can do this using filter.
filter((x) -> x % 3 != 0, a)
This is basically equivalent to your Javascript code. We can, incidentally, also use broadcasting.
a[a .% 3 .!= 0]
This is more akin to code you would see in array-oriented languages like MATLAB and R.
Skipping by Index
With an extra enumerate call, we can get the indices to operate on.
map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate(a)))
This is roughly what you'd do in Python. enumerate to get the indices, filter to purge, then map to eliminate the now-unnecessary indices.
Or we can, again, use broadcasting.
a[(1:length(a)) .% 3 .!= 0]
If you need skipping by index the most elegant way is to use InvertedIndices
julia> using InvertedIndices # or using DataFrames
julia> a[Not(3:3:end)]
4-element Vector{Int64}:
1
2
4
5
As you can see all your job here is to provide a range of indices you wish to skip.
If you want to filter by the index, one convenient way is using a comprehension:
julia> a = 10:10:100;
julia> [a[i] for i in eachindex(a) if i % 3 != 0] |> permutedims
1×7 Matrix{Int64}:
10 20 40 50 70 80 100
julia> vec(ans) == [a[1 + 3(j-1)÷2] for j in 1:7]
true
This implicitly involves Iterators.filter, and collects the generator. You can also use this to filter by value, although the eager filter is probably more efficient:
julia> a = 1:10;
julia> [x for x in a if x%3!=0] |> permutedims
1×7 Matrix{Int64}:
1 2 4 5 7 8 10
Perhaps it's interesting to time all of these:
julia> using BenchmarkTools, InvertedIndices
julia> a = rand(1000); # filter by index
julia> i1 = #btime [$a[1 + 3(j-1)÷2] for j in 1:667];
373.162 ns (1 allocation: 5.38 KiB)
julia> i2 = #btime $a[eachindex($a) .% 3 .!= 0];
1.387 μs (4 allocations: 9.80 KiB)
julia> i3 = #btime [$a[i] for i in eachindex($a) if i % 3 != 0];
3.557 μs (11 allocations: 16.47 KiB)
julia> i4 = #btime map((x) -> x[2], Iterators.filter(((x) -> x[1] % 3 != 0), enumerate($a)));
4.202 μs (11 allocations: 16.47 KiB)
julia> i5 = #btime $a[Not(3:3:end)];
84.333 μs (4655 allocations: 182.28 KiB)
julia> i1 == i2 == i3 == i4 == i5
true
julia> a = rand(1:99, 1000); # filter by value
julia> v1 = #btime filter(x -> x%3!=0, $a);
532.185 ns (1 allocation: 7.94 KiB)
julia> v2 = #btime [x for x in $a if x%3!=0];
5.465 μs (11 allocations: 16.47 KiB)
julia> v1 == v2
true
This should help you:
b = a[Bool[i %3 != 0 for i = 1:length(a)]]
a[a .% 2 .!= 0]
please find the link with code.

Julia: A fast and elegant way to get a matrix from an array of arrays

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:
v = [[rand(),rand()], ..., [rand(),rand()]]
I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
w = hcat(x,y)
The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?
I hope this is short and efficient enough for you:
getindex.(v, [1 2])
and if you want something simpler to digest:
[v[i][j] for i in 1:length(v), j in 1:2]
Also the hcat solution could be written as:
permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));
and it should not hang your Julia (please confirm - it works for me).
#Antonello: to understand why this works consider a simpler example:
julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
"a1" "a2"
"b1" "b2"
"c1" "c2"
I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?
Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by #BogumilKaminski, but is faster:
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
You can simplify it a bit further, without losing performance, like this:
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
A benchmark of the various solutions posted so far...
using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]
M1 = #btime vcat([[e[1] e[2]] for e in $v]...)
M2 = #btime getindex.($v, [1 2])
M3 = #btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = #btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = #btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))
function original(v)
x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
push!(x, v[i][1])
push!(y, v[i][2])
end
return hcat(x,y)
end
function mat(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for i in eachindex(v)
M[i, 1] = v[i][1]
M[i, 2] = v[i][2]
end
return M
end
function mat_simpler(v)
M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
for (i, x) in pairs(v)
M[i, 1], M[i, 2] = x
end
return M
end
M6 = #btime original($v)
M7 = #btime mat($v)
M8 = #btime mat($v)
M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true
Output:
1.126 ms (10010 allocations: 1.53 MiB) # M1
54.161 μs (3 allocations: 156.42 KiB) # M2
809.000 μs (38983 allocations: 765.50 KiB) # M3
98.935 μs (4 allocations: 312.66 KiB) # M4
244.696 μs (10 allocations: 469.23 KiB) # M5
219.907 μs (30 allocations: 669.61 KiB) # M6
34.311 μs (2 allocations: 156.33 KiB) # M7
34.395 μs (2 allocations: 156.33 KiB) # M8
Note that the dollar sign in the benchmarked code is just to force #btime to consider the vector as a local variable.

Total numbers having frequency k in a given range

How to find total numbers having frequency=k in a particular range(l,r) in a given array. There are total 10^5 queries of format l,r and each query is built on the basis of previous query's answer. In particular, after each query we increment l by the result of the query, swapping l and r if l > r. Note that 0<=a[i]<=10^9. Total elements in array is n=10^5.
My Attempt:
n,k,q = map(int,input().split())
a = list(map(int,input().split()))
ans = 0
for _ in range(q):
l,r = map(int,input().split())
l+=ans
l%=n
r+=ans
r%=n
if l>r:
l,r = r,l
d = {}
for i in a[l:r+1]:
try:
d[i]+=1
except:
d[i] = 1
curr_ans = 0
for i in d.keys():
if d[i]==k:
curr_ans+=1
ans = curr_ans
print(ans)
Sample Input:
5 2 3
7 6 6 5 5
0 4
3 0
4 1
Sample Output:
2
1
1
If the number of different values in the array is not too large, you may consider storing arrays as long as the input array, one per unique value, counting the number of appearances of the value until each point. Then you just need to subtract the end values from the beginning values to find how many frequency matches are there:
def range_freq_queries(seq, k, queries):
n = len(seq)
c = freq_counts(seq)
result = [0] * len(queries)
offset = 0
for i, (l, r) in enumerate(queries):
result[i] = range_freq_matches(c, offset, l, r, k, n)
offset = result[i]
return result
def freq_counts(seq):
s = {v: i for i, v in enumerate(set(seq))}
counts = [None] * (len(seq) + 1)
counts[0] = [0] * len(s)
for i, v in enumerate(seq, 1):
counts[i] = list(counts[i - 1])
j = s[v]
counts[i][j] += 1
return counts
def range_freq_matches(counts, offset, start, end, k, n):
start, end = sorted(((start + offset) % n, (end + offset) % n))
num = 0
return sum(1 for cs, ce in zip(counts[start], counts[end + 1]) if ce - cs == k)
seq = [7, 6, 6, 5, 5]
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries(seq, k, queries))
# [2, 1, 1]
You can do it faster with NumPy, too. Since each result depends on the previous one, you will have to loop in any case, but you can use Numba to really accelerate things up:
import numpy as np
import numba as nb
def range_freq_queries_np(seq, k, queries):
seq = np.asarray(seq)
c = freq_counts_np(seq)
return _range_freq_queries_np_nb(seq, k, queries, c)
#nb.njit # This is not necessary but will make things faster
def _range_freq_queries_np_nb(seq, k, queries, c):
n = len(seq)
offset = np.int32(0)
out = np.empty(len(queries), dtype=np.int32)
for i, (l, r) in enumerate(queries):
l = (l + offset) % n
r = (r + offset) % n
l, r = min(l, r), max(l, r)
out[i] = np.sum(c[r + 1] - c[l] == k)
offset = out[i]
return out
def freq_counts_np(seq):
uniq = np.unique(seq)
seq_pad = np.concatenate([[uniq.max() + 1], seq])
comp = seq_pad[:, np.newaxis] == uniq
return np.cumsum(comp, axis=0)
seq = np.array([7, 6, 6, 5, 5])
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries_np(seq, k, queries))
# [2 1 2]
Let's compare it with the original algorithm:
from collections import Counter
def range_freq_queries_orig(seq, k, queries):
n = len(seq)
ans = 0
counter = Counter()
out = [0] * len(queries)
for i, (l, r) in enumerate(queries):
l += ans
l %= n
r += ans
r %= n
if l > r:
l, r = r, l
counter.clear()
counter.update(seq[l:r+1])
ans = sum(1 for v in counter.values() if v == k)
out[i] = ans
return out
Here is a quick test and timing:
import random
import numpy
# Make random input
random.seed(0)
seq = random.choices(range(1000), k=5000)
queries = [(random.choice(range(len(seq))), random.choice(range(len(seq))))
for _ in range(20000)]
k = 20
# Input as array for NumPy version
seq_arr = np.asarray(seq)
# Check all functions return the same result
res1 = range_freq_queries_orig(seq, k, queries)
res2 = range_freq_queries(seq, k, queries)
print(all(r1 == r2 for r1, r2 in zip(res1, res2)))
# True
res3 = range_freq_queries_np(seq_arr, k, queries)
print(all(r1 == r3 for r1, r3 in zip(res1, res3)))
# True
# Timings
%timeit range_freq_queries_orig(seq, k, queries)
# 3.07 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries(seq, k, queries)
# 1.1 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries_np(seq_arr, k, queries)
# 265 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Obviously the effectiveness of this depends on the characteristics of the data. In particular, if there are fewer repeated values the time and memory cost to construct the counts table will approach O(n2).
Let's say the input array is A, |A|=n. I'm going to assume that the number of distinct elements in A is much smaller than n.
We can divide A into sqrt(n) segments each of size sqrt(n). For each of these segments, we can calculate a map from element to count. Building these maps takes O(n) time.
With that preprocessing done, we can answer each query by adding together all the maps wholly contained in (l,r), of which there are at most sqrt(n), then adding any extra elements (or going one segment over and subtracting), also sqrt(n).
If there are k distinct elements, this takes O(sqrt(n) * k) so in the worst case O(n) if in fact every element of A is distinct.
You can keep track of the elements that have the desired count while combining the hashes and extra elements.

MATLAB-style replacement of array values that meet certain condition in Julia [duplicate]

In Octave, I can do
octave:1> A = [1 2; 3 4]
A =
1 2
3 4
octave:2> A(A>1) -= 1
A =
1 1
2 3
but in Julia, the equivalent syntax does not work.
julia> A = [1 2; 3 4]
2x2 Array{Int64,2}:
1 2
3 4
julia> A[A>1] -= 1
ERROR: `isless` has no method matching isless(::Int64, ::Array{Int64,2})
in > at operators.jl:33
How do you conditionally assign values to certain array or matrix elements in Julia?
Your problem isn't with the assignment, per se, it's that A > 1 itself doesn't work. You can use the elementwise A .> 1 instead:
julia> A = [1 2; 3 4];
julia> A .> 1
2×2 BitArray{2}:
false true
true true
julia> A[A .> 1] .-= 1000;
julia> A
2×2 Array{Int64,2}:
1 -998
-997 -996
Update:
Note that in modern Julia (>= 0.7), we need to use . to say that we want to broadcast the action (here, subtracting by the scalar 1000) to match the size of the filtered target on the left. (At the time this question was originally asked, we needed the dot in A .> 1 but not in .-=.)
In Julia v1.0 you can use the replace! function instead of logical indexing, with considerable speedups:
julia> B = rand(0:20, 8, 2);
julia> #btime (A[A .> 10] .= 10) setup=(A=copy($B))
595.784 ns (11 allocations: 4.61 KiB)
julia> #btime replace!(x -> x>10 ? 10 : x, A) setup=(A=copy($B))
13.530 ns ns (0 allocations: 0 bytes)
For larger matrices, the difference hovers around 10x speedup.
The reason for the speedup is that the logical indexing solution relies on creating an intermediate array, while replace! avoids this.
A slightly terser way of writing it is
replace!(x -> min(x, 10), A)
There doesn't seem to be any speedup using min, though.
And here's another solution that is almost as fast:
A .= min.(A, 10)
and that also avoids allocations.
To make it work in Julia 1.0 one need to change = to .=. In other words:
julia> a = [1 2 3 4]
julia> a[a .> 1] .= 1
julia> a
1×4 Array{Int64,2}:
1 1 1 1
Otherwise you will get something like
ERROR: MethodError: no method matching setindex_shape_check(::Int64, ::Int64)

Resources