Comparing two arrays in ClickHouse rows - arrays

Any options to compare two arrays in ClickHouse?
There are two columns colA and colB, each contains an array.
If there any algorithm that compares arrays in colA and colB for each row in a ClickHouse table and sets colC value to 1 if arrays are equal, 0 if arrays are not equal?
For example:
colA | colB | colC
---------------------------------|----------------------------------|-----
{555,571,701,707,741,1470,4965} | {555,571,701,707,741,1470,4965} |1
{555,571,701,707,741,1470,4965} | {555,571,701,707,741,1470,4964} |0

I asked the same question at ClickHouse Google Group and got this answer from Denis Zhuravlev:
In the latest version of CH 18.1.0, 2018-07-23 (#2026):
select [111,222] A, [111,222] B, [111,333] C, A=B ab, A=C ac
results in
┌─A─────────┬─B─────────┬─C─────────┬─ab─┬─ac─┐
│ [111,222] │ [111,222] │ [111,333] │ 1 │ 0 │
└───────────┴───────────┴───────────┴────┴────┘
Before 18.1.0 you can use lambdas or something:
SELECT
NOT has(groupArray(A = B), 0) ab
,NOT has(groupArray(A = C), 0) ac
FROM
(
SELECT
[111,222] A
,[111,222] B
,[111,333] C
)
ARRAY JOIN
A
,B
,C
┌─ab─┬─ac─┐
│ 1 │ 0 │
└────┴────┘

I think equal works now 20.3.5.21
Cloud10 :) SELECT [2,1] = [1,2]
SELECT [2, 1] = [1, 2]
┌─equals([2, 1], [1, 2])─┐
│ 0 │
└────────────────────────┘
1 rows in set. Elapsed: 0.003 sec.
Cloud10 :) SELECT [2,1] = [2,1]
SELECT [2, 1] = [2, 1]
┌─equals([2, 1], [2, 1])─┐
│ 1 │
└────────────────────────┘
1 rows in set. Elapsed: 0.003 sec.

Related

Sum of Julia Dataframe column where values of another column are in a list

How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15

How to combine and make an array into a cell of DataFrames.jl?

Let say
df = DataFrame(a=[1])
Row │ a
│ Int64
─────┼───────
1 │ 1
We have
Tried to combine data and make a new column holding arrays
combine(df, :a => x->[1,2])
Row │ a_function
│ Int64
─────┼────────────
1 │ 1
2 │ 2
Tried to combine data and make a new column holding tuples
combine(df, :a => x->(1,2))
Row │ a_function
│ Tuple…
─────┼────────────
1 │ (1, 2)
Why 1 doesn't work as intended, i.e. holding the whole [1,2] array in one cell instead of creating 2 rows?
I converted the array into a tuple and it worked, but I wonder why they works in so different ways.

Clickhouse SELECT Array values that have all elements in a specific order

I need to find an array that contains all of the values of another array, in the presented order - similar to 'hasString' or 'hasAll' - or an idea of how to go about this.
Example Data
dataCol = [1,2,3,4]
hasSubstr
hasSubstr is close however if the values are not an exact match, they are not a match.
hasSubstr(dataCol, [1,2,4]) will return 0
But I need a 1 here because 1, 2 and 4 are in dataCol in the order of 1 then 2 then 4.
Has All
hasAll is also close however it doesn't care about the order.
hasAll(dataCol, [4,2,1]) will return 1
But I need a 0 here because the order is incorrect.
Function or Query?
Something the equivalent of the 'imaginary' function: hasAllOrdered(dataCol, [1,3,4]) = 1
Or an idea of how to construct a query for this. Maybe a combination of hasAll and some query logic magic?
Edit: To clarify my intended result, I need to run a query to select multiple columns as could be used with a function.
SELECT
path AS dataCol,
track
FROM tracks
WHERE time_start > 1645232556
AND { magic here returning rows containing [276,277,279] in dataCol }
LIMIT 10
Query id: cac9b576-193e-475f-98e4-84354bf13af4
┌─dataCol───────────────────────────────────┬──track─┐
│ [211,210,207,205,204] │ 413354 │
│ [211,210,207,205,204] │ 413355 │
│ [73,74,142,209,277,276,208] │ 413356 │
│ [73,74,142,209,277,276,208] │ 413357 │
│ [280,279] │ 413358 │
│ [280,279] │ 413359 │
│ [272,208,276,277,278,346,347,273,206,207] │ 413360 │
│ [208,276,277,278,346,272,273,206,207,347] │ 413361 │
│ [276,277,278,279,348,208,209,141] │ 413362 │
│ [141,276,208,209,277,278,279,348] │ 413363 │
└───────────────────────────────────────────┴────────┘
10 rows in set. Elapsed: 0.007 sec. Processed 13.59 thousand rows, 273.88 KB (1.86 million rows/s., 37.49 MB/s.)
Ref: https://clickhouse.com/docs/en/sql-reference/functions/array-functions/
You have 2 arrays [a, b] and [a,,b]
Lets build the second array through indexes of the first (indexOf + arrayMap) === [a,d,b] --> [1,0,2], remove zeros d by indexOf ( <> 0) --> [1,2]
Now we need the array only if indexes are grow, otherwise elements is in a wrong order.
arrayDifference == [1,2] -> [0,1]. Now if this array has negative elements then indexes are not grow
-- not arrayExists j < 0
create table tracks( dataCol Array(UInt64), track UInt64 ) Engine = Memory;
insert into tracks values
( [211,210,207,205,204] , 413354)
( [211,210,207,205,204] , 413355)
( [280,279] , 413358)
( [280,279] , 413359)
( [272,208,276,277,278,346,347,273,206,207], 413360)
( [208,276,277,278,346,272,273,206,207,347], 413361)
( [276,277,278,279,348,208,209,141] , 413362)
( [141,276,208,209,277,278,279,348] , 413363);
select *
from tracks
where hasAll(dataCol, [276,277,279] as x ) and not arrayExists(j -> j<0, arrayDifference(arrayFilter(i->indexOf(dataCol, i)<>0, x)))
┌─dataCol───────────────────────────┬──track─┐
│ [276,277,278,279,348,208,209,141] │ 413362 │
│ [141,276,208,209,277,278,279,348] │ 413363 │
└───────────────────────────────────┴────────┘

groupby() using two arrays in julia?

I have two arrays with same dimension:
a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]
And I want to group both of them with respect to array a1 and get the mean of the array a2 for each group.
My output is coming from array a2, as mentioned below:
result:
1.5
3.0
4.0
5.5
Please suggest an approach to achieve this task.
Thanks!!
Here is a solution using DataFrames.jl:
julia> using DataFrames, Statistics
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
EDIT:
Here are the timings (as usual in Julia you need to remember that the first time you run some function it has to be compiled which takes time):
julia> using DataFrames, Statistics
(#v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
Status `D:\.julia\environments\v1.6\Project.toml`
[a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups
julia> #time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)
julia> #time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
0.442082 seconds (294 allocations: 762.990 MiB)
Note that e.g. data.table (if this is your reference) on similar data is noticeably slower:
> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
user system elapsed
4.72 1.20 2.00
In case you are interested in using Chain.jl in addition to DataFrames.jl, Bogumił Kamiński's answer might then look like this:
julia> using DataFrames, Statistics, Chain
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> #chain df begin
groupby(:a1)
combine(:a2 => mean)
end
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5

Evaluate whether an array contains an element other than given elements

I am trying to determine if my given array _servicetype contains an element other than 12,1,2,3.
Below is what I have so far,
Scenario 1: if my array is {1,2,3,6015} I want FALSE
Scenario 2: if my array is {1,2,12} I want TRUE
Scenario 3: if my array is {1,2} I want true
I ended up creating the iif statement as a User defined function in Postgres and got the following below:
IIF(_servicetype#>ARRAY['12']::INT[]
OR _servicetype#>ARRAY['1'] ::INT[]
OR _servicetype#>ARRAY['2'] ::INT[]
OR _servicetype#>ARRAY['3'] ::INT[],TRUE,FALSE)::BOOLEAN
My concern is it will not work for Scenario 1.
You can check that ARRAY[12, 1, 2, 3] is a superset of _servicetype using the #> (contains/covers) operator, ie if _supertype contains anything not in ARRAY[12, 1, 2, 3] return false:
WITH examples(_servicetype) AS (
VALUES
('{1,2,3,6015}'::int[]),
('{2,1}'::int[]),
('{1}'::int[])
)
SELECT _servicetype, '{12, 1, 2, 3}' #> _servicetype
FROM examples;
┌──────────────┬──────────┐
│ _servicetype │ ?column? │
├──────────────┼──────────┤
│ {1,2,3,6015} │ f │
│ {2,1} │ t │ -- set-wise "contains", order does not matter
│ {1} │ t │
└──────────────┴──────────┘
(3 rows)

Resources