Can someone explain how arrayJoin() function works in Clickhouse database? - database

I am a newbie to ClickHouse DB and the example provided in the documentation doesn't help in properly understanding the concept. Explanation with how arrayjoin() can be used with simple scenarios would be appreciated.

Let's consider the next scenarios:
when need to transform Array to a relation (set of rows)
/* get error */
SELECT 1
WHERE 1 IN ([1, 2]);
/* ok */
SELECT 1
WHERE 1 IN (SELECT arrayJoin([1, 2]));
/* get error */
SELECT *
FROM (SELECT [1, 2] a)
WHERE a = 2;
/* ok */
SELECT *
FROM (SELECT arrayJoin([1, 2]) a)
WHERE a = 2;
to unfold/flatten the rows
SELECT
metric_id,
metric_name,
arrayJoin(metric_values) AS metric_value
FROM
( /* test data */
SELECT
1 AS metric_id,
'name_1' AS metric_name,
[1, 4, 55] AS metric_values
UNION ALL
SELECT
2 AS metric_id,
'name_2' AS metric_name,
[-7, 11] AS metric_values
)
/* result
┌─metric_id─┬─metric_name─┬─metric_value─┐
│ 1 │ name_1 │ 1 │
│ 1 │ name_1 │ 4 │
│ 1 │ name_1 │ 55 │
│ 2 │ name_2 │ -7 │
│ 2 │ name_2 │ 11 │
└───────────┴─────────────┴──────────────┘
*/
/* produce Cartesian product */
SELECT
arrayJoin([1, 2]) AS n,
arrayJoin(['a', 'b']) AS ll,
arrayJoin(['A', 'B']) AS ul
/* result
┌─n─┬─ll─┬─ul─┐
│ 1 │ a │ A │
│ 1 │ a │ B │
│ 1 │ b │ A │
│ 1 │ b │ B │
│ 2 │ a │ A │
│ 2 │ a │ B │
│ 2 │ b │ A │
│ 2 │ b │ B │
└───┴────┴────┘
*/
/* flatten the multidimension array */
SELECT arrayJoin(arrayJoin([[1, 2], [3, 4]])) AS d
/* result
┌─d─┐
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└───┘
*/
When you need to chain arrays item by item instead of getting cartesian product consider using ARRAY JOIN:
/* cartesian product */
SELECT
arrayJoin(arr1),
arrayJoin(arr2)
FROM
(
SELECT
[1, 2] AS arr1,
[11, 22] AS arr2
)
/*
┌─arrayJoin(arr1)─┬─arrayJoin(arr2)─┐
│ 1 │ 11 │
│ 1 │ 22 │
│ 2 │ 11 │
│ 2 │ 22 │
└─────────────────┴─────────────────┘
*/
/* connect array's item one by one */
SELECT a1, a2, arr1, arr2
FROM
(
SELECT
[1, 2] AS arr1,
[11, 22] AS arr2
)
ARRAY JOIN arr1 as a1, arr2 as a2
/*
┌─a1─┬─a2─┬─arr1──┬─arr2────┐
│ 1 │ 11 │ [1,2] │ [11,22] │
│ 2 │ 22 │ [1,2] │ [11,22] │
└────┴────┴───────┴─────────┘
*/

Related

How to display "Null" (SQLite)

On SQLite, I displayed the table "user" as shown below but "Null" is not displayed so I cannot differentiate between "Null" and Blank(Empty String):
sqlite> .header on
sqlite> .mode box
sqlite> select * from user;
┌────┬─────────────────┐
│ id │ name │
├────┼─────────────────┤
│ 1 │ Steve Jobs │
│ 2 │ │ <- Null
│ 3 │ │ <- Null
│ 4 │ Bill Gates │
│ 5 │ │ <- Blank(Empty String)
│ 6 │ Mark Zuckerberg │
└────┴─────────────────┘
Are there any ways to display "Null"?
This command below sets "String" values to "Null" values:
.nullvalue <String>
So, set "Null" as shown below:
.nullvalue Null
Then, "Null" are displayed for "Null" values as shown below:
sqlite> .header on
sqlite> .mode box
sqlite> select * from user;
┌────┬─────────────────┐
│ id │ name │
├────┼─────────────────┤
│ 1 │ Steve Jobs │
│ 2 │ Null │ <- Null
│ 3 │ Null │ <- Null
│ 4 │ Bill Gates │
│ 5 │ │ <- Blank(Empty String)
│ 6 │ Mark Zuckerberg │
└────┴─────────────────┘
Next, set "This is Null." as shown below:
.nullvalue "This is Null."
Then, "This is Null" are displayed for "Null" values as shown below:
sqlite> .header on
sqlite> .mode box
sqlite> select * from user;
┌────┬─────────────────┐
│ id │ name │
├────┼─────────────────┤
│ 1 │ Steve Jobs │
│ 2 │ This is Null. │ <- Null
│ 3 │ This is Null. │ <- Null
│ 4 │ Bill Gates │
│ 5 │ │ <- Blank(Empty String)
│ 6 │ Mark Zuckerberg │
└────┴─────────────────┘
And these commands below show the details of the command ".nullvalue":
.help .nullvalue
Or:
.help nullvalue
Then, this is how it looks like below:
sqlite> .help .nullvalue
.nullvalue STRING Use STRING in place of NULL values

Julia multidimensional array

I have multidimensional array stored in dataframe in Julia.
dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = ["M","F","F"])
3×2 DataFrame
│ Row │ a │ b │
│ │ Array… │ String │
├─────┼───────────┼────────┤
│ 1 │ [1, 2, 3] │ M │
│ 2 │ [4, 5, 6] │ F │
│ 3 │ [7, 8, 9] │ F │
I would like to get the first column "a" and store the first value in each element in X1 (1,4,7) and second value in each row in X2 (2,5,8) and third value in each row in X3 (3,6,9).
How can we accomplish this in Julia programming language?
You could try this:
for i in 1:3
dfy[:, "X$i"] = getindex.(dfy.a,i)
end
Once run here is the result:
julia> dfy
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
The dot . after getindex is a vectorization operator and hence you are gettinh i-th element from each row of the a column of your DataFrame.
I give several options to show you what you can do.
Before I give my options let me comment on the alternative answer, which in general is a most natural way to get what you want, if you want to update the existing data frame. DataFrames.jl does not support indexing by a column name only. DataFrame.jl is a two dimensional object and thus it requires passing both row and column index like this:
julia> for i in 1:3
dfy[:, "X$i"] = getindex.(dfy.a, i)
end
julia> dfy
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
(note that this is what the error message prompts you to do -- i.e. that setindex! requires one more argument to be passed)
Now some more advanced options. The first is:
julia> rename!(x -> "X"*x, DataFrame(Tuple.(dfy.a)))
3×3 DataFrame
│ Row │ X1 │ X2 │ X3 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
│ 2 │ 4 │ 5 │ 6 │
│ 3 │ 7 │ 8 │ 9 │
because I understand you want a new data frame,
or to create a new data frame combining old an new columns just use horizontal concatentation:
julia> [dfy rename!(x -> "X"*x, DataFrame(Tuple.(dfy.a)))]
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
Finally, if you want to update the existing data frame you can write:
julia> transform!(dfy, [:a => (x -> getindex.(x, i)) => "X$i" for i in 1:3]...)
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │

Can Query.jl in Julia create rows from arrays of varying size?

I am wondering how to use Query.jl to extract values from an array in a dataframe and turn them into separate rows.
Background: I have used TextAnalysis.jl to tokenize some text and would like to have a dataframe with one row per token for the subsequent processing. #map got me that far, but despite various attempts with #mapmany I have not been able to get to a solution. Maybe #mapmany is the wrong choice here.
example = DataFrame(line = 1:4, text = [["First", "line"], ["Then", "number", "two"], ["And", "numero", "tres"], ["The", "End"]])
How do I get a result like this:
line | word
-----------------
1 | First
1 | line
2 | Then
… | …
3 | tres
4 | The
4 | End
DataFrames.flatten should do what you want:
julia> using DataFrames
julia> example = DataFrame(line = 1:4, text = [["First", "line"], ["Then", "number", "two"], ["And", "numero", "tres"], ["The", "End"]])
4×2 DataFrame
│ Row │ line │ text │
│ │ Int64 │ Array{String,1} │
├─────┼───────┼───────────────────────────┤
│ 1 │ 1 │ ["First", "line"] │
│ 2 │ 2 │ ["Then", "number", "two"] │
│ 3 │ 3 │ ["And", "numero", "tres"] │
│ 4 │ 4 │ ["The", "End"] │
julia> flatten(example, :text)
10×2 DataFrame
│ Row │ line │ text │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ First │
│ 2 │ 1 │ line │
│ 3 │ 2 │ Then │
│ 4 │ 2 │ number │
│ 5 │ 2 │ two │
│ 6 │ 3 │ And │
│ 7 │ 3 │ numero │
│ 8 │ 3 │ tres │
│ 9 │ 4 │ The │
│ 10 │ 4 │ End │
You can manually map this out using repeat and some array manipulations
lines = [...]
flat_lines = vcat(lines...)
line_no = repeat(1:4, inner=2)
df = DataFrame(line=line_no, word=flat_lines)

How to implement `pivot` in clickhouse just like in dolphindb

I want to do some pivot ops to some data. Just like following.
>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
... 'two'],
... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
... 'baz': [1, 2, 3, 4, 5, 6],
... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
foo bar baz zoo
0 one A 1 x
1 one B 2 y
2 one C 3 z
3 two A 4 q
4 two B 5 w
5 two C 6 t
>>> df.pivot(index='foo', columns='bar', values='baz')
bar A B C
foo
one 1 2 3
two 4 5 6
I know DolphinDB can do pivot in sql.
dateValue=2007.08.01
num=500
syms = (exec count(*) from taq
where
date = dateValue,
time between 09:30:00 : 15:59:59,
0<bid, bid<ofr, ofr<bid*1.2
group by symbol order by count desc).symbol[0:num]
priceMatrix = exec avg(bid + ofr)/2.0 as price from taq
where
date = dateValue, Symbol in syms,
0<bid, bid<ofr, ofr<bid*1.2,
time between 09:30:00 : 15:59:59
pivot by time.minute() as minute, Symbol
but how to do pivot in clickhouse? Should I use client API to get data ? But there are too many rows, it's too difficult to deal with many rows. And if I can't use pandas, how to implement pivot operation easily?
It is the preliminary implementation that can help you make a start.
Remarks:
'holes' in rows is not supported (each column should contain value)
the types of all column casted to common type (String)
introduced the field orderNum. It is the order number of source column in result (for example, 'bar'-column be 2nd)
the result represented as rows with one column with Array-type. The order of array items is defined by orderNum.
Prepare test data:
CREATE TABLE test.pivot_test
(
orderNum Int,
s String,
values Array(String)
) ENGINE = Memory;
INSERT INTO test.pivot_test
VALUES
(1, 'foo', ['one', 'one', 'one', 'two', 'two', 'two']),
(3, 'baz', ['1', '2', '3', '4', '5', '6']),
(4, 'zoo', ['x', 'y', 'z', 'q', 'w', 't']),
(2, 'bar', ['A', 'B', 'C', 'A', 'B', 'C']);
/*
The content of table test.pivot_test:
┌─orderNum─┬─s───┬─values────────────────────────────────┐
│ 1 │ foo │ ['one','one','one','two','two','two'] │
│ 3 │ baz │ ['1','2','3','4','5','6'] │
│ 4 │ zoo │ ['x','y','z','q','w','t'] │
│ 2 │ bar │ ['A','B','C','A','B','C'] │
└──────────┴─────┴───────────────────────────────────────┘
*/
Pivot-emulation:
SELECT arrayMap(x -> x.1, arraySort(x -> x.2, groupArray(value_ordernum))) as row
FROM
(
SELECT
(value, orderNum) AS value_ordernum,
value_index
FROM test.pivot_test
ARRAY JOIN
values AS value,
arrayEnumerate(values) AS value_index
/*
The result of execution the nested query:
┌─value_ordernum─┬─value_index─┐
│ ('one',1) │ 1 │
│ ('one',1) │ 2 │
│ ('one',1) │ 3 │
│ ('two',1) │ 4 │
│ ('two',1) │ 5 │
│ ('two',1) │ 6 │
│ ('1',3) │ 1 │
│ ('2',3) │ 2 │
│ ('3',3) │ 3 │
│ ('4',3) │ 4 │
│ ('5',3) │ 5 │
│ ('6',3) │ 6 │
│ ('x',4) │ 1 │
│ ('y',4) │ 2 │
│ ('z',4) │ 3 │
│ ('q',4) │ 4 │
│ ('w',4) │ 5 │
│ ('t',4) │ 6 │
│ ('A',2) │ 1 │
│ ('B',2) │ 2 │
│ ('C',2) │ 3 │
│ ('A',2) │ 4 │
│ ('B',2) │ 5 │
│ ('C',2) │ 6 │
└────────────────┴─────────────┘
*/
)
GROUP BY value_index;
/*
The final result:
┌─row─────────────────┐
│ ['two','A','4','q'] │
│ ['one','C','3','z'] │
│ ['one','B','2','y'] │
│ ['two','B','5','w'] │
│ ['one','A','1','x'] │
│ ['two','C','6','t'] │
└─────────────────────┘
*/

Plotting multiple Gadfly plots in a loop and saving to pdf

I am a relatively new Julia coder, and I want to plot multiple Gadfly plots in a loop and save them to a pdf.
If I do it without a loop, just for one block, the code is like this:
plot(x=[0, B_length[2,1,1],B_length[2,1,1],0],
y=[0,0,1,1],
layer(x=[B_length[2,1,1]/2], y=[0.5], label=[string("Spec: ", string(K_names[Kblock1[1]]))], Geom.label(position=:centered)), #specialty label for block 1
layer(x=[B_length[2,1,1]/2], y=[4.5], label=[string("Surgeon: ", string(S_names[Sblock1[1]]))], Geom.label(position=:centered)), #surgeon label for block 1
layer(x=[B_length[2,1,1]/2], y=[2.5], label=[string("No. P = ", string(Pblock1), " L = ", string(Lblock1))], Geom.label(position=:centered)), #number of patients and emergency
layer(x=[0, B_length[2,1,1],B_length[2,1,1],0],y=[4,4,5,5], #Surgeon - Block1
Geom.polygon(fill=true),
Theme(default_color=color("#00CCCC"))),
layer(x=[0, block2pt_end,block2pt_end,0],y=[2,2,3,3], #Patient and Emergency - Block1
Geom.polygon(fill=true),
Theme(default_color=color("gray"))),
Geom.polygon(preserve_order=true, fill=true))
And produces this graph:
When I go back and replace the values with their variables and build in a loop, I have this code:
for i=1:I
for r=1:R
B2used = sum(x_blockval[i, :, 2, r])
if (B2used >= 1)
k_block2[i,r] = findfirst([x_blockval[i,k,2,r] for k=1:K]) #block=1
s_block2[i,r] = findfirst([x_sval[r,i,2,s,k_block2[i,r]] for s=1:S])
p_block2[i,r] = x_pval[r,i,2,k_block2[i,r]]
l_block2[i,r] = x_lval[r,i,2,k_block2[i,r]]
block2pt_end[i,r] = F_lμ[k_block2[i,r]] + F_lσ[k_block2[i,r]]
println("Block 2 used on Day ", i, ", Room ", r, ", Specialty = ", k_block2[i,r], ", Surgeon = ", s_block2[i,r], ", num emergent patients = ", p_block2[i,r], ", num emergency patients = ", l_block2[i,r], ", end time of surgery = ", block2pt_end[i,r])
plot(x=[0, B_length[r,i,2],B_length[r,i,2],0],
y=[0,0,1,1],
layer(x=[B_length[r,i,2]/2], y=[0.5], label=[string("Spec: ", string(K_names[k_block2[i,r]]))], Geom.label(position=:centered)), #specialty label for block 1
layer(x=[B_length[r,i,2]/2], y=[4.5], label=[string("Surgeon: ", string(S_names[k_block2[i,r]]))], Geom.label(position=:centered)), #surgeon label for block 1
layer(x=[B_length[r,i,2]/2], y=[2.5], label=[string("No. P = ", string(p_block2[i,r]), " L = ", string(l_block2[i,r]))], Geom.label(position=:centered)), #number of patients and emergency
layer(x=[0, B_length[r,i,2],B_length[r,i,2],0],y=[4,4,5,5], #Surgeon - Block1
Geom.polygon(fill=true),
Theme(default_color=color("#00CCCC"))),
layer(x=[0, block2pt_end,block2pt_end,0],y=[2,2,3,3], #Patient and Emergency - Block1
Geom.polygon(fill=true),
Theme(default_color=color("gray"))),
Geom.polygon(preserve_order=true, fill=true))
end
end
end
But it does not produce any graphs. Ideally I'd like to save all the graphs in a pdf. I know somehow I can use 'draw(PDF("filename.pdf", 6inch, 9inch), vstack(p1,p[2])' or something like that...
You can try the following simple example:
using DataFrames
using Distributions
using Gadfly
df = DataFrame(reshape(rand(Normal(600, 5), 1000), 100, 10))
So that my random data looks like:
100×10 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 602.263 │ 605.084 │ 601.423 │ 604.667 │ 600.886 │ 601.023 │ 605.838 │
│ 2 │ 594.87 │ 595.692 │ 598.958 │ 605.384 │ 597.867 │ 597.473 │ 601.041 │
│ 3 │ 607.446 │ 598.39 │ 603.289 │ 603.193 │ 596.1 │ 605.915 │ 606.762 │
│ 4 │ 594.34 │ 599.959 │ 599.37 │ 600.762 │ 593.688 │ 598.451 │ 610.389 │
│ 5 │ 599.136 │ 598.936 │ 603.524 │ 602.486 │ 603.15 │ 615.217 │ 595.568 │
│ 6 │ 602.579 │ 592.314 │ 607.15 │ 599.523 │ 595.064 │ 599.134 │ 595.103 │
│ 7 │ 607.965 │ 593.523 │ 597.015 │ 596.653 │ 587.134 │ 601.784 │ 602.604 │
│ 8 │ 594.854 │ 601.6 │ 595.535 │ 601.941 │ 598.745 │ 598.318 │ 598.639 │
⋮
│ 92 │ 602.218 │ 594.57 │ 598.399 │ 598.049 │ 605.2 │ 599.202 │ 601.325 │
│ 93 │ 603.337 │ 590.793 │ 603.014 │ 592.754 │ 600.761 │ 598.986 │ 603.304 │
│ 94 │ 606.847 │ 600.5 │ 599.646 │ 602.013 │ 603.085 │ 604.451 │ 607.767 │
│ 95 │ 601.857 │ 603.694 │ 601.155 │ 609.234 │ 598.313 │ 593.84 │ 604.078 │
│ 96 │ 601.497 │ 612.724 │ 602.648 │ 599.876 │ 603.636 │ 598.606 │ 596.051 │
│ 97 │ 587.651 │ 597.113 │ 611.405 │ 608.394 │ 601.602 │ 593.162 │ 599.186 │
│ 98 │ 600.314 │ 592.158 │ 598.192 │ 596.135 │ 594.07 │ 612.595 │ 606.035 │
│ 99 │ 603.8 │ 600.477 │ 601.18 │ 601.254 │ 603.464 │ 591.172 │ 605.914 │
│ 100 │ 602.12 │ 599.617 │ 600.363 │ 591.685 │ 607.037 │ 599.461 │ 608.207 │
│ Row │ x8 │ x9 │ x10 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 597.964 │ 604.5 │ 606.66 │
│ 2 │ 591.599 │ 599.966 │ 602.308 │
│ 3 │ 590.3 │ 593.76 │ 600.803 │
│ 4 │ 601.73 │ 600.084 │ 601.51 │
│ 5 │ 592.919 │ 600.804 │ 606.705 │
│ 6 │ 602.157 │ 596.47 │ 608.817 │
│ 7 │ 594.768 │ 599.574 │ 604.912 │
│ 8 │ 590.703 │ 599.676 │ 598.265 │
⋮
│ 92 │ 599.965 │ 599.058 │ 603.945 │
│ 93 │ 601.147 │ 604.591 │ 594.569 │
│ 94 │ 597.442 │ 603.474 │ 593.651 │
│ 95 │ 600.611 │ 596.317 │ 598.212 │
│ 96 │ 604.025 │ 599.2 │ 597.129 │
│ 97 │ 595.265 │ 604.271 │ 593.711 │
│ 98 │ 594.281 │ 602.019 │ 603.592 │
│ 99 │ 603.784 │ 592.89 │ 599.684 │
│ 100 │ 592.422 │ 599.386 │ 601.215 │
Now to plot each variable or column of the data using loop and saving it as PDF to your desktop, you simply run the following:
for i in names(df)
p1 = plot(x = 1:length(df[i]), y = df[i])
draw(PDF(joinpath(homedir(), "Desktop/" * string(i) * ".pdf"), 10cm, 10cm), p1)
end

Resources