My data structure looks similar to
tdata = Array{Int64,1}[]
# After 1st collection, push the first batch of data
push!(tdata, [1, 2, 3, 4, 5])
# After 2nd collection, push this batch of data
push!(tdata, [11, 12, 13, 14, 15])
Therefore, my data is
> tdata
2-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5]
[11, 12, 13, 14, 15]
When I tried to convert this to a DataFrame,
> convert(DataFrame, tdata)
ERROR: MethodError: Cannot `convert` an object of type Array{Array{Int64,1},1} to an object of type DataFrame
while I was hoping for similar to
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
Alternatively, I tried to save it to .CSV, but
> CSV.write("",tdata)
ERROR: ArgumentError: 'Array{Array{Int64,1},1}' iterates 'Array{Int64,1}' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Clearly, I have some misunderstanding of the data structure I have. Any suggestion is apprecitaed!
Either do this:
julia> using SplitApplyCombine
julia> DataFrame(invert(tdata), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(transpose(hcat(tdata...)), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(vcat(transpose(tdata)...), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> df = DataFrame(["c$i" => Int[] for i in 1:5])
0×5 DataFrame
julia> foreach(x -> push!(df, x), tdata)
julia> df
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
The challenge with your data is that you want vectors to be rows of the data frame, and normally vectors are treated as columns of a data frame.
Related
How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15
What is the efficient way to resize a matrix along the first dimension, i.e. add rows?
It is not possible for standard Matrix type. You can only resize Vector by e.g. doing append! or push!.
You can vertically concatenate two matrices, but this allocates a new matrix:
julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
1 2
3 4
julia> y = [5 6; 7 8]
2×2 Matrix{Int64}:
5 6
7 8
julia> [x; y]
4×2 Matrix{Int64}:
1 2
3 4
5 6
7 8
The reason why adding a new row a matrix in-place is not supported is that it cannot be done efficiently because of the memory layout of a matrix (essentially the cost of such operation would be similar to vertical concatenation).
You would need another data structure to be able to do such resizing in place. For example DataFrame from DataFrames.jl supports this (but note that quite likely vertical concatenation I have described above is best for your use case):
julia> using DataFrames
julia> df = DataFrame(a=[1,2], b=[11,12])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> push!(df, [3, 13])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
The reason why it is possible for DataFrame efficiently is that internally it is a vector of vectors so you can push! data to each vector representing a column.
How can I create a dataframe out of separate arrays?
For example, I want this, but 18 rows by two columns.
using DataFrames
df = DataFrame(
year = [[3:1:20;]],
amt = [fill(200, 18)]
)
You don't need any arrays:
julia> using DataFrames
julia> df = DataFrame(year = 3:1:20, amt = 200)
18×2 DataFrame
Row │ year amt
│ Int64 Int64
─────┼──────────────
1 │ 3 200
2 │ 4 200
3 │ 5 200
4 │ 6 200
5 │ 7 200
6 │ 8 200
7 │ 9 200
8 │ 10 200
9 │ 11 200
10 │ 12 200
11 │ 13 200
12 │ 14 200
13 │ 15 200
14 │ 16 200
15 │ 17 200
16 │ 18 200
17 │ 19 200
18 │ 20 200
If this seems a bit magical (passing a range object and a single value rather than arrays), you can get the same result if you pass in "real" arrays like DataFrame(year = collect(3:1:20), amt = fill(200, 18)). Note however that this is unnecessary and less efficient.
Also note that your enclosing square brackets are probably not what you're after: fill(200, 18) already creates an array:
julia> fill(200, 18)
18-element Vector{Int64}:
200
200
(Vector{Int} is an alias for Array{Int, 1}), while enclosing this in another set of brackets will create an array of length one, which holds your amt array as its only element:
julia> [fill(200, 18)]
1-element Vector{Vector{Int64}}:
[200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]
I have two arrays with same dimension:
a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]
And I want to group both of them with respect to array a1 and get the mean of the array a2 for each group.
My output is coming from array a2, as mentioned below:
result:
1.5
3.0
4.0
5.5
Please suggest an approach to achieve this task.
Thanks!!
Here is a solution using DataFrames.jl:
julia> using DataFrames, Statistics
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
EDIT:
Here are the timings (as usual in Julia you need to remember that the first time you run some function it has to be compiled which takes time):
julia> using DataFrames, Statistics
(#v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
Status `D:\.julia\environments\v1.6\Project.toml`
[a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups
julia> #time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)
julia> #time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
0.442082 seconds (294 allocations: 762.990 MiB)
Note that e.g. data.table (if this is your reference) on similar data is noticeably slower:
> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
user system elapsed
4.72 1.20 2.00
In case you are interested in using Chain.jl in addition to DataFrames.jl, Bogumił Kamiński's answer might then look like this:
julia> using DataFrames, Statistics, Chain
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> #chain df begin
groupby(:a1)
combine(:a2 => mean)
end
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
Any options to compare two arrays in ClickHouse?
There are two columns colA and colB, each contains an array.
If there any algorithm that compares arrays in colA and colB for each row in a ClickHouse table and sets colC value to 1 if arrays are equal, 0 if arrays are not equal?
For example:
colA | colB | colC
---------------------------------|----------------------------------|-----
{555,571,701,707,741,1470,4965} | {555,571,701,707,741,1470,4965} |1
{555,571,701,707,741,1470,4965} | {555,571,701,707,741,1470,4964} |0
I asked the same question at ClickHouse Google Group and got this answer from Denis Zhuravlev:
In the latest version of CH 18.1.0, 2018-07-23 (#2026):
select [111,222] A, [111,222] B, [111,333] C, A=B ab, A=C ac
results in
┌─A─────────┬─B─────────┬─C─────────┬─ab─┬─ac─┐
│ [111,222] │ [111,222] │ [111,333] │ 1 │ 0 │
└───────────┴───────────┴───────────┴────┴────┘
Before 18.1.0 you can use lambdas or something:
SELECT
NOT has(groupArray(A = B), 0) ab
,NOT has(groupArray(A = C), 0) ac
FROM
(
SELECT
[111,222] A
,[111,222] B
,[111,333] C
)
ARRAY JOIN
A
,B
,C
┌─ab─┬─ac─┐
│ 1 │ 0 │
└────┴────┘
I think equal works now 20.3.5.21
Cloud10 :) SELECT [2,1] = [1,2]
SELECT [2, 1] = [1, 2]
┌─equals([2, 1], [1, 2])─┐
│ 0 │
└────────────────────────┘
1 rows in set. Elapsed: 0.003 sec.
Cloud10 :) SELECT [2,1] = [2,1]
SELECT [2, 1] = [2, 1]
┌─equals([2, 1], [2, 1])─┐
│ 1 │
└────────────────────────┘
1 rows in set. Elapsed: 0.003 sec.