I have multidimensional array stored in dataframe in Julia.
dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = ["M","F","F"])
3×2 DataFrame
│ Row │ a │ b │
│ │ Array… │ String │
├─────┼───────────┼────────┤
│ 1 │ [1, 2, 3] │ M │
│ 2 │ [4, 5, 6] │ F │
│ 3 │ [7, 8, 9] │ F │
I would like to get the first column "a" and store the first value in each element in X1 (1,4,7) and second value in each row in X2 (2,5,8) and third value in each row in X3 (3,6,9).
How can we accomplish this in Julia programming language?
You could try this:
for i in 1:3
dfy[:, "X$i"] = getindex.(dfy.a,i)
end
Once run here is the result:
julia> dfy
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
The dot . after getindex is a vectorization operator and hence you are gettinh i-th element from each row of the a column of your DataFrame.
I give several options to show you what you can do.
Before I give my options let me comment on the alternative answer, which in general is a most natural way to get what you want, if you want to update the existing data frame. DataFrames.jl does not support indexing by a column name only. DataFrame.jl is a two dimensional object and thus it requires passing both row and column index like this:
julia> for i in 1:3
dfy[:, "X$i"] = getindex.(dfy.a, i)
end
julia> dfy
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
(note that this is what the error message prompts you to do -- i.e. that setindex! requires one more argument to be passed)
Now some more advanced options. The first is:
julia> rename!(x -> "X"*x, DataFrame(Tuple.(dfy.a)))
3×3 DataFrame
│ Row │ X1 │ X2 │ X3 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
│ 2 │ 4 │ 5 │ 6 │
│ 3 │ 7 │ 8 │ 9 │
because I understand you want a new data frame,
or to create a new data frame combining old an new columns just use horizontal concatentation:
julia> [dfy rename!(x -> "X"*x, DataFrame(Tuple.(dfy.a)))]
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
Finally, if you want to update the existing data frame you can write:
julia> transform!(dfy, [:a => (x -> getindex.(x, i)) => "X$i" for i in 1:3]...)
3×5 DataFrame
│ Row │ a │ b │ X1 │ X2 │ X3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │
I am wondering how to use Query.jl to extract values from an array in a dataframe and turn them into separate rows.
Background: I have used TextAnalysis.jl to tokenize some text and would like to have a dataframe with one row per token for the subsequent processing. #map got me that far, but despite various attempts with #mapmany I have not been able to get to a solution. Maybe #mapmany is the wrong choice here.
example = DataFrame(line = 1:4, text = [["First", "line"], ["Then", "number", "two"], ["And", "numero", "tres"], ["The", "End"]])
How do I get a result like this:
line | word
-----------------
1 | First
1 | line
2 | Then
… | …
3 | tres
4 | The
4 | End
DataFrames.flatten should do what you want:
julia> using DataFrames
julia> example = DataFrame(line = 1:4, text = [["First", "line"], ["Then", "number", "two"], ["And", "numero", "tres"], ["The", "End"]])
4×2 DataFrame
│ Row │ line │ text │
│ │ Int64 │ Array{String,1} │
├─────┼───────┼───────────────────────────┤
│ 1 │ 1 │ ["First", "line"] │
│ 2 │ 2 │ ["Then", "number", "two"] │
│ 3 │ 3 │ ["And", "numero", "tres"] │
│ 4 │ 4 │ ["The", "End"] │
julia> flatten(example, :text)
10×2 DataFrame
│ Row │ line │ text │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ First │
│ 2 │ 1 │ line │
│ 3 │ 2 │ Then │
│ 4 │ 2 │ number │
│ 5 │ 2 │ two │
│ 6 │ 3 │ And │
│ 7 │ 3 │ numero │
│ 8 │ 3 │ tres │
│ 9 │ 4 │ The │
│ 10 │ 4 │ End │
You can manually map this out using repeat and some array manipulations
lines = [...]
flat_lines = vcat(lines...)
line_no = repeat(1:4, inner=2)
df = DataFrame(line=line_no, word=flat_lines)
I am a newbie to ClickHouse DB and the example provided in the documentation doesn't help in properly understanding the concept. Explanation with how arrayjoin() can be used with simple scenarios would be appreciated.
Let's consider the next scenarios:
when need to transform Array to a relation (set of rows)
/* get error */
SELECT 1
WHERE 1 IN ([1, 2]);
/* ok */
SELECT 1
WHERE 1 IN (SELECT arrayJoin([1, 2]));
/* get error */
SELECT *
FROM (SELECT [1, 2] a)
WHERE a = 2;
/* ok */
SELECT *
FROM (SELECT arrayJoin([1, 2]) a)
WHERE a = 2;
to unfold/flatten the rows
SELECT
metric_id,
metric_name,
arrayJoin(metric_values) AS metric_value
FROM
( /* test data */
SELECT
1 AS metric_id,
'name_1' AS metric_name,
[1, 4, 55] AS metric_values
UNION ALL
SELECT
2 AS metric_id,
'name_2' AS metric_name,
[-7, 11] AS metric_values
)
/* result
┌─metric_id─┬─metric_name─┬─metric_value─┐
│ 1 │ name_1 │ 1 │
│ 1 │ name_1 │ 4 │
│ 1 │ name_1 │ 55 │
│ 2 │ name_2 │ -7 │
│ 2 │ name_2 │ 11 │
└───────────┴─────────────┴──────────────┘
*/
/* produce Cartesian product */
SELECT
arrayJoin([1, 2]) AS n,
arrayJoin(['a', 'b']) AS ll,
arrayJoin(['A', 'B']) AS ul
/* result
┌─n─┬─ll─┬─ul─┐
│ 1 │ a │ A │
│ 1 │ a │ B │
│ 1 │ b │ A │
│ 1 │ b │ B │
│ 2 │ a │ A │
│ 2 │ a │ B │
│ 2 │ b │ A │
│ 2 │ b │ B │
└───┴────┴────┘
*/
/* flatten the multidimension array */
SELECT arrayJoin(arrayJoin([[1, 2], [3, 4]])) AS d
/* result
┌─d─┐
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└───┘
*/
When you need to chain arrays item by item instead of getting cartesian product consider using ARRAY JOIN:
/* cartesian product */
SELECT
arrayJoin(arr1),
arrayJoin(arr2)
FROM
(
SELECT
[1, 2] AS arr1,
[11, 22] AS arr2
)
/*
┌─arrayJoin(arr1)─┬─arrayJoin(arr2)─┐
│ 1 │ 11 │
│ 1 │ 22 │
│ 2 │ 11 │
│ 2 │ 22 │
└─────────────────┴─────────────────┘
*/
/* connect array's item one by one */
SELECT a1, a2, arr1, arr2
FROM
(
SELECT
[1, 2] AS arr1,
[11, 22] AS arr2
)
ARRAY JOIN arr1 as a1, arr2 as a2
/*
┌─a1─┬─a2─┬─arr1──┬─arr2────┐
│ 1 │ 11 │ [1,2] │ [11,22] │
│ 2 │ 22 │ [1,2] │ [11,22] │
└────┴────┴───────┴─────────┘
*/
I am a relatively new Julia coder, and I want to plot multiple Gadfly plots in a loop and save them to a pdf.
If I do it without a loop, just for one block, the code is like this:
plot(x=[0, B_length[2,1,1],B_length[2,1,1],0],
y=[0,0,1,1],
layer(x=[B_length[2,1,1]/2], y=[0.5], label=[string("Spec: ", string(K_names[Kblock1[1]]))], Geom.label(position=:centered)), #specialty label for block 1
layer(x=[B_length[2,1,1]/2], y=[4.5], label=[string("Surgeon: ", string(S_names[Sblock1[1]]))], Geom.label(position=:centered)), #surgeon label for block 1
layer(x=[B_length[2,1,1]/2], y=[2.5], label=[string("No. P = ", string(Pblock1), " L = ", string(Lblock1))], Geom.label(position=:centered)), #number of patients and emergency
layer(x=[0, B_length[2,1,1],B_length[2,1,1],0],y=[4,4,5,5], #Surgeon - Block1
Geom.polygon(fill=true),
Theme(default_color=color("#00CCCC"))),
layer(x=[0, block2pt_end,block2pt_end,0],y=[2,2,3,3], #Patient and Emergency - Block1
Geom.polygon(fill=true),
Theme(default_color=color("gray"))),
Geom.polygon(preserve_order=true, fill=true))
And produces this graph:
When I go back and replace the values with their variables and build in a loop, I have this code:
for i=1:I
for r=1:R
B2used = sum(x_blockval[i, :, 2, r])
if (B2used >= 1)
k_block2[i,r] = findfirst([x_blockval[i,k,2,r] for k=1:K]) #block=1
s_block2[i,r] = findfirst([x_sval[r,i,2,s,k_block2[i,r]] for s=1:S])
p_block2[i,r] = x_pval[r,i,2,k_block2[i,r]]
l_block2[i,r] = x_lval[r,i,2,k_block2[i,r]]
block2pt_end[i,r] = F_lμ[k_block2[i,r]] + F_lσ[k_block2[i,r]]
println("Block 2 used on Day ", i, ", Room ", r, ", Specialty = ", k_block2[i,r], ", Surgeon = ", s_block2[i,r], ", num emergent patients = ", p_block2[i,r], ", num emergency patients = ", l_block2[i,r], ", end time of surgery = ", block2pt_end[i,r])
plot(x=[0, B_length[r,i,2],B_length[r,i,2],0],
y=[0,0,1,1],
layer(x=[B_length[r,i,2]/2], y=[0.5], label=[string("Spec: ", string(K_names[k_block2[i,r]]))], Geom.label(position=:centered)), #specialty label for block 1
layer(x=[B_length[r,i,2]/2], y=[4.5], label=[string("Surgeon: ", string(S_names[k_block2[i,r]]))], Geom.label(position=:centered)), #surgeon label for block 1
layer(x=[B_length[r,i,2]/2], y=[2.5], label=[string("No. P = ", string(p_block2[i,r]), " L = ", string(l_block2[i,r]))], Geom.label(position=:centered)), #number of patients and emergency
layer(x=[0, B_length[r,i,2],B_length[r,i,2],0],y=[4,4,5,5], #Surgeon - Block1
Geom.polygon(fill=true),
Theme(default_color=color("#00CCCC"))),
layer(x=[0, block2pt_end,block2pt_end,0],y=[2,2,3,3], #Patient and Emergency - Block1
Geom.polygon(fill=true),
Theme(default_color=color("gray"))),
Geom.polygon(preserve_order=true, fill=true))
end
end
end
But it does not produce any graphs. Ideally I'd like to save all the graphs in a pdf. I know somehow I can use 'draw(PDF("filename.pdf", 6inch, 9inch), vstack(p1,p[2])' or something like that...
You can try the following simple example:
using DataFrames
using Distributions
using Gadfly
df = DataFrame(reshape(rand(Normal(600, 5), 1000), 100, 10))
So that my random data looks like:
100×10 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 602.263 │ 605.084 │ 601.423 │ 604.667 │ 600.886 │ 601.023 │ 605.838 │
│ 2 │ 594.87 │ 595.692 │ 598.958 │ 605.384 │ 597.867 │ 597.473 │ 601.041 │
│ 3 │ 607.446 │ 598.39 │ 603.289 │ 603.193 │ 596.1 │ 605.915 │ 606.762 │
│ 4 │ 594.34 │ 599.959 │ 599.37 │ 600.762 │ 593.688 │ 598.451 │ 610.389 │
│ 5 │ 599.136 │ 598.936 │ 603.524 │ 602.486 │ 603.15 │ 615.217 │ 595.568 │
│ 6 │ 602.579 │ 592.314 │ 607.15 │ 599.523 │ 595.064 │ 599.134 │ 595.103 │
│ 7 │ 607.965 │ 593.523 │ 597.015 │ 596.653 │ 587.134 │ 601.784 │ 602.604 │
│ 8 │ 594.854 │ 601.6 │ 595.535 │ 601.941 │ 598.745 │ 598.318 │ 598.639 │
⋮
│ 92 │ 602.218 │ 594.57 │ 598.399 │ 598.049 │ 605.2 │ 599.202 │ 601.325 │
│ 93 │ 603.337 │ 590.793 │ 603.014 │ 592.754 │ 600.761 │ 598.986 │ 603.304 │
│ 94 │ 606.847 │ 600.5 │ 599.646 │ 602.013 │ 603.085 │ 604.451 │ 607.767 │
│ 95 │ 601.857 │ 603.694 │ 601.155 │ 609.234 │ 598.313 │ 593.84 │ 604.078 │
│ 96 │ 601.497 │ 612.724 │ 602.648 │ 599.876 │ 603.636 │ 598.606 │ 596.051 │
│ 97 │ 587.651 │ 597.113 │ 611.405 │ 608.394 │ 601.602 │ 593.162 │ 599.186 │
│ 98 │ 600.314 │ 592.158 │ 598.192 │ 596.135 │ 594.07 │ 612.595 │ 606.035 │
│ 99 │ 603.8 │ 600.477 │ 601.18 │ 601.254 │ 603.464 │ 591.172 │ 605.914 │
│ 100 │ 602.12 │ 599.617 │ 600.363 │ 591.685 │ 607.037 │ 599.461 │ 608.207 │
│ Row │ x8 │ x9 │ x10 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 597.964 │ 604.5 │ 606.66 │
│ 2 │ 591.599 │ 599.966 │ 602.308 │
│ 3 │ 590.3 │ 593.76 │ 600.803 │
│ 4 │ 601.73 │ 600.084 │ 601.51 │
│ 5 │ 592.919 │ 600.804 │ 606.705 │
│ 6 │ 602.157 │ 596.47 │ 608.817 │
│ 7 │ 594.768 │ 599.574 │ 604.912 │
│ 8 │ 590.703 │ 599.676 │ 598.265 │
⋮
│ 92 │ 599.965 │ 599.058 │ 603.945 │
│ 93 │ 601.147 │ 604.591 │ 594.569 │
│ 94 │ 597.442 │ 603.474 │ 593.651 │
│ 95 │ 600.611 │ 596.317 │ 598.212 │
│ 96 │ 604.025 │ 599.2 │ 597.129 │
│ 97 │ 595.265 │ 604.271 │ 593.711 │
│ 98 │ 594.281 │ 602.019 │ 603.592 │
│ 99 │ 603.784 │ 592.89 │ 599.684 │
│ 100 │ 592.422 │ 599.386 │ 601.215 │
Now to plot each variable or column of the data using loop and saving it as PDF to your desktop, you simply run the following:
for i in names(df)
p1 = plot(x = 1:length(df[i]), y = df[i])
draw(PDF(joinpath(homedir(), "Desktop/" * string(i) * ".pdf"), 10cm, 10cm), p1)
end