groupby() using two arrays in julia? - arrays

I have two arrays with same dimension:
a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]
And I want to group both of them with respect to array a1 and get the mean of the array a2 for each group.
My output is coming from array a2, as mentioned below:
result:
1.5
3.0
4.0
5.5
Please suggest an approach to achieve this task.
Thanks!!

Here is a solution using DataFrames.jl:
julia> using DataFrames, Statistics
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
EDIT:
Here are the timings (as usual in Julia you need to remember that the first time you run some function it has to be compiled which takes time):
julia> using DataFrames, Statistics
(#v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
Status `D:\.julia\environments\v1.6\Project.toml`
[a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups
julia> #time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)
julia> #time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
0.442082 seconds (294 allocations: 762.990 MiB)
Note that e.g. data.table (if this is your reference) on similar data is noticeably slower:
> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
user system elapsed
4.72 1.20 2.00

In case you are interested in using Chain.jl in addition to DataFrames.jl, Bogumił Kamiński's answer might then look like this:
julia> using DataFrames, Statistics, Chain
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> #chain df begin
groupby(:a1)
combine(:a2 => mean)
end
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5

Related

Sum of Julia Dataframe column where values of another column are in a list

How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15

How to implement One-Vs-Rest for Multi-Class Classification in Julia?

I’m new to Julia and i am trying to implement One-Vs-Rest Multi-Class Classification, and I was wondering if anyone could help me out. Here is a snippet of my code so far:
My data frame is basic since I’m trying to figure out the implementation first, my c column is my class consisting of [0, 1, 2], and my y, x1, x2, x3 are random Int64 values.
using DataFrames
using CSV
using StatsBase
using StatsModels
using Statistics
using Plots, StatsPlots
using GLM
using Lathe
df = DataFrame(CSV.File(“data.csv”))
fm = #formula(c~x1+x2+x3+y)
model0 = glm(fm0, df, Binomial(), ProbitLink()) # 0 vs [1,2]
model1 = glm(fm1, df, Binomial(), ProbitLink()) # 1 vs [0,2]
model2 = glm(fm2, df, Binomial(), ProbitLink()) # 2 vs [0,1]
I am trying to make logistic models but I don’t know how to do it.
If anyone can help me out, I would be thrilled.
I am trying to split the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
My only problem is that I don't how to write the logistic model for a multi-class dataset.
Here is how you can do the same manually using GLM.jl (there is a lot of boilerplate code, but I wanted to keep the example simple):
df = DataFrame(x1=rand(100), x2=rand(100), x3=rand(100), target=rand([0, 1, 2], 100));
model0 = glm(#formula((target==0)~x1+x2+x3), df, Binomial(), ProbitLink())
model1 = glm(#formula((target==1)~x1+x2+x3), df, Binomial(), ProbitLink())
model2 = glm(#formula((target==2)~x1+x2+x3), df, Binomial(), ProbitLink())
choice = argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1 # need to subtract 1 to use 0-based indexing
Let me explain the last operation step by step:
get the predictions of three models as columns of a matrix
julia> [predict(model0) predict(model1) predict(model2)]
100×3 Matrix{Float64}:
0.517606 0.314234 0.206062
0.173916 0.431573 0.389071
0.211322 0.355592 0.413929
0.252108 0.337381 0.387629
0.515388 0.306834 0.211937
0.169052 0.386062 0.436603
0.125764 0.395105 0.490297
0.0955411 0.347634 0.589351
0.449734 0.341201 0.227459
⋮
0.412786 0.281343 0.303454
0.209337 0.354169 0.417261
0.37683 0.345307 0.273704
0.187584 0.411171 0.390831
0.401612 0.243119 0.350124
0.323155 0.338805 0.322453
0.488678 0.300927 0.23324
0.0979282 0.413296 0.522639
0.195902 0.313932 0.472582
Iterate rows of this matrix:
julia> eachrow([predict(model0) predict(model1) predict(model2)])
Base.Generator{Base.OneTo{Int64}, Base.var"#240#241"{Matrix{Float64}}}(Base.var"#240#241"{Matrix{Float64}}([0.5176063824396965 0.3142344514631397 0.2060615429588215; 0.17391563070921184 0.4315728844478078 0.3890711795309746; … ; 0.09792824142064335 0.41329629745776897 0.5226385962610233; 0.19590183503978997 0.31393218775269705 0.4725817014561341]), Base.OneTo(100))
For each row get index of maximum value:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)]))
100-element Vector{Int64}:
1
2
3
3
1
3
3
3
1
⋮
1
3
1
2
1
2
1
3
3
Subtract 1 from the result as Julia uses 1-based indexing, and you wanted the first model to have number 0:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Alternatively you could write:
julia> map(predict(model0), predict(model1), predict(model2)) do x...
return argmax(x) - 1
end
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Which is more efficient and shorter, but I was not sure if it is clearer as it uses slurping.
An example how to train one model for three classes using Flux.jl (still using the same df source data frame):
using Flux
model = Chain(Dense(3 => 3, σ), softmax)
X = permutedims(Matrix(df[:, 1:3]))
y = Flux.onehotbatch(df.target, 0:2)
optim = Flux.setup(Flux.Adam(0.01), model)
for epoch in 1:1_000
Flux.train!(model, [(X, y)], optim) do m, x, y
y_hat = m(x)
Flux.crossentropy(y_hat, y)
end
end
Personally, I choose the Julia implementation for it. So Bogumił Kamiński's answer would be superior to mine.
I don't know if any packages provide a multi-target/label Logistic Regression model implemented fully in Julia (I would like to know if there are any, I'll prepend them to this answer). But, you can apply the model using ScikitLearn.jl which is a wrapper for Scikit-learn in Python and uses a connected python session to run the code. You can get further information in their repository. But, I created synthetic data as similar as I could to what you have to train the model on it and show you how you can do it:
#- Packages
using DataFrames
using ScikitLearn
#sk_import linear_model: LogisticRegression
#- Syntethetic data
df = DataFrame(
x1=rand(100),
x2=rand(100),
x3=rand(100),
target=rand([0, 1, 2], 100)
)
# 100×4 DataFrame
# Row │ x1 x2 x3 target
# │ Float64 Float64 Float64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 0.607024 0.6818 0.562058 0
# 2 │ 0.235538 0.974469 0.553292 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 99 │ 0.382491 0.224192 0.122515 1
# 100 │ 0.617425 0.793276 0.228549 0
#- Split data
train, test = df[1:80, :], df[81:end, :]
Then I train the LogisticRegresssion (which inherently runs the same object in the sklearn.py):
#- Train model
model = LogisticRegression(multi_class="ovr")
fit!(model, Matrix(train[:, 1:3]), train[:, 4])
And the last phase would be the prediction:
#- Predict
preds = predict(model, Matrix(test[:, 1:3]));
#- Count right predictions
sum(preds .== test[:, 4])
# returns `6` in my case
Note that you need to install PyCall.jl to use ScikitLearn.jl. Make sure to follow the instructions provided by the PyCall.jl to set up the required environment first.

How can I efficiently resize a matrix in julia?

What is the efficient way to resize a matrix along the first dimension, i.e. add rows?
It is not possible for standard Matrix type. You can only resize Vector by e.g. doing append! or push!.
You can vertically concatenate two matrices, but this allocates a new matrix:
julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
1 2
3 4
julia> y = [5 6; 7 8]
2×2 Matrix{Int64}:
5 6
7 8
julia> [x; y]
4×2 Matrix{Int64}:
1 2
3 4
5 6
7 8
The reason why adding a new row a matrix in-place is not supported is that it cannot be done efficiently because of the memory layout of a matrix (essentially the cost of such operation would be similar to vertical concatenation).
You would need another data structure to be able to do such resizing in place. For example DataFrame from DataFrames.jl supports this (but note that quite likely vertical concatenation I have described above is best for your use case):
julia> using DataFrames
julia> df = DataFrame(a=[1,2], b=[11,12])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> push!(df, [3, 13])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
The reason why it is possible for DataFrame efficiently is that internally it is a vector of vectors so you can push! data to each vector representing a column.

Converting array to DataFrame or Saving to CSV in Julia

My data structure looks similar to
tdata = Array{Int64,1}[]
# After 1st collection, push the first batch of data
push!(tdata, [1, 2, 3, 4, 5])
# After 2nd collection, push this batch of data
push!(tdata, [11, 12, 13, 14, 15])
Therefore, my data is
> tdata
2-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5]
[11, 12, 13, 14, 15]
When I tried to convert this to a DataFrame,
> convert(DataFrame, tdata)
ERROR: MethodError: Cannot `convert` an object of type Array{Array{Int64,1},1} to an object of type DataFrame
while I was hoping for similar to
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
Alternatively, I tried to save it to .CSV, but
> CSV.write("",tdata)
ERROR: ArgumentError: 'Array{Array{Int64,1},1}' iterates 'Array{Int64,1}' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Clearly, I have some misunderstanding of the data structure I have. Any suggestion is apprecitaed!
Either do this:
julia> using SplitApplyCombine
julia> DataFrame(invert(tdata), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(transpose(hcat(tdata...)), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(vcat(transpose(tdata)...), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> df = DataFrame(["c$i" => Int[] for i in 1:5])
0×5 DataFrame
julia> foreach(x -> push!(df, x), tdata)
julia> df
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
The challenge with your data is that you want vectors to be rows of the data frame, and normally vectors are treated as columns of a data frame.

Convert Julia array to dataframe

I have an array X that I'd like to convert to a dataframe. Upon recommendation from the web, I tried converting to a dataframe and get the following error.
julia> y=convert(DataFrame,x)
ERROR:converthas no method matching convert(::Type{DataFrame}, ::Array{Float64,2})
in convert at base.jl:13
When I try DataFrame(x), the conversion works but i get a complaint that the conversion is deprecated.
julia> DataFrame(x)
WARNING: DataFrame(::Matrix, ::Vector)) is deprecated, use convert(DataFrame, Matrix) instead in DataFrame at /Users/Matthew/.julia/v0.3/DataFrames/src/deprecated.jl:54 (repeats 2 times)
Is there another method I should be aware of to keep my code consistent?
EDIT:
Julia 0.3.2,
DataFrames 0.5.10
OSX 10.9.5
julia> x=rand(4,4)
4x4 Array{Float64,2}:
0.467882 0.466358 0.28144 0.0151388
0.22354 0.358616 0.669564 0.828768
0.475064 0.187992 0.584741 0.0543435
0.0592643 0.345138 0.704496 0.844822
julia> convert(DataFrame,x)
ERROR: `convert` has no method matching convert(::Type{DataFrame}, ::Array{Float64,2}) in convert at base.jl:13
This works for me:
julia> using DataFrames
julia> x = rand(4, 4)
4x4 Array{Float64,2}:
0.790912 0.0367989 0.425089 0.670121
0.243605 0.62487 0.582498 0.302063
0.785159 0.0083891 0.881153 0.353925
0.618127 0.827093 0.577815 0.488565
julia> convert(DataFrame, x)
4x4 DataFrame
| Row | x1 | x2 | x3 | x4 |
|-----|----------|-----------|----------|----------|
| 1 | 0.790912 | 0.0367989 | 0.425089 | 0.670121 |
| 2 | 0.243605 | 0.62487 | 0.582498 | 0.302063 |
| 3 | 0.785159 | 0.0083891 | 0.881153 | 0.353925 |
| 4 | 0.618127 | 0.827093 | 0.577815 | 0.488565 |
Are you trying something different?
If that doesn't work try posting a bit more code we can help you better.
Since this is the first thing that comes up when you google, for more recent versions of DataFrames.jl, you can just use the DataFrame() function now:
julia> x = rand(4,4)
4×4 Matrix{Float64}:
0.920406 0.738911 0.994401 0.9954
0.18791 0.845132 0.277577 0.231483
0.361269 0.918367 0.793115 0.988914
0.725052 0.962762 0.413111 0.328261
julia> DataFrame(x, :auto)
4×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────
1 │ 0.920406 0.738911 0.994401 0.9954
2 │ 0.18791 0.845132 0.277577 0.231483
3 │ 0.361269 0.918367 0.793115 0.988914
4 │ 0.725052 0.962762 0.413111 0.328261
I've been confounded by the same issue a number of times, and eventually realized the issue is often related to the format of the array, and is easily resolved by simply transposing the array prior to conversion.
In short, I recommend:
julia> convert(DataFrame, x')
# convert a Matrix{Any} with a header row of col name strings to a DataFrame
# e.g. mat2df(["a" "b" "c"; 1 2 3; 4 5 6])
mat2df(mat) = convert(DataFrame,Dict(mat[1,:],
[mat[2:end,i] for i in 1:size(mat,2)]))
# convert a Matrix{Any} (mat) and a list of col name strings (headerstrings)
# to a DataFrame, e.g. matnms2df([1 2 3;4 5 6], ["a","b","c"])
matnms2df(mat, headerstrs) = convert(DataFrame,
Dict(zip(headerstrs,[mat[:,i] for i in 1:size(mat,2)])))
A little late, but with the update to the DataFrame() function, I created a custom function that would take a matrix (e.g. an XLSX imported dataset) and convert it into a DataFrame using the first row as column headers. Saves me a ton of time and, hopefully, it helps you too.
function MatrixToDataFrame(mat)
DF_mat = DataFrame(
mat[2:end, 1:end],
string.(mat[1, 1:end])
)
return DF_mat
end
So I found this online and honestly felt dumb.
using CSV
WhatIWant = DataFrame(WhatIHave)
this was adapted from an R guide, but it works so heck
DataFrame([1 2 3 4; 5 6 7 8; 9 10 11 12], :auto)
This works as per >? DataFrame

Resources