What is the efficient way to resize a matrix along the first dimension, i.e. add rows?
It is not possible for standard Matrix type. You can only resize Vector by e.g. doing append! or push!.
You can vertically concatenate two matrices, but this allocates a new matrix:
julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
1 2
3 4
julia> y = [5 6; 7 8]
2×2 Matrix{Int64}:
5 6
7 8
julia> [x; y]
4×2 Matrix{Int64}:
1 2
3 4
5 6
7 8
The reason why adding a new row a matrix in-place is not supported is that it cannot be done efficiently because of the memory layout of a matrix (essentially the cost of such operation would be similar to vertical concatenation).
You would need another data structure to be able to do such resizing in place. For example DataFrame from DataFrames.jl supports this (but note that quite likely vertical concatenation I have described above is best for your use case):
julia> using DataFrames
julia> df = DataFrame(a=[1,2], b=[11,12])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
julia> push!(df, [3, 13])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
The reason why it is possible for DataFrame efficiently is that internally it is a vector of vectors so you can push! data to each vector representing a column.
Related
How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15
I’m new to Julia and i am trying to implement One-Vs-Rest Multi-Class Classification, and I was wondering if anyone could help me out. Here is a snippet of my code so far:
My data frame is basic since I’m trying to figure out the implementation first, my c column is my class consisting of [0, 1, 2], and my y, x1, x2, x3 are random Int64 values.
using DataFrames
using CSV
using StatsBase
using StatsModels
using Statistics
using Plots, StatsPlots
using GLM
using Lathe
df = DataFrame(CSV.File(“data.csv”))
fm = #formula(c~x1+x2+x3+y)
model0 = glm(fm0, df, Binomial(), ProbitLink()) # 0 vs [1,2]
model1 = glm(fm1, df, Binomial(), ProbitLink()) # 1 vs [0,2]
model2 = glm(fm2, df, Binomial(), ProbitLink()) # 2 vs [0,1]
I am trying to make logistic models but I don’t know how to do it.
If anyone can help me out, I would be thrilled.
I am trying to split the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
My only problem is that I don't how to write the logistic model for a multi-class dataset.
Here is how you can do the same manually using GLM.jl (there is a lot of boilerplate code, but I wanted to keep the example simple):
df = DataFrame(x1=rand(100), x2=rand(100), x3=rand(100), target=rand([0, 1, 2], 100));
model0 = glm(#formula((target==0)~x1+x2+x3), df, Binomial(), ProbitLink())
model1 = glm(#formula((target==1)~x1+x2+x3), df, Binomial(), ProbitLink())
model2 = glm(#formula((target==2)~x1+x2+x3), df, Binomial(), ProbitLink())
choice = argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1 # need to subtract 1 to use 0-based indexing
Let me explain the last operation step by step:
get the predictions of three models as columns of a matrix
julia> [predict(model0) predict(model1) predict(model2)]
100×3 Matrix{Float64}:
0.517606 0.314234 0.206062
0.173916 0.431573 0.389071
0.211322 0.355592 0.413929
0.252108 0.337381 0.387629
0.515388 0.306834 0.211937
0.169052 0.386062 0.436603
0.125764 0.395105 0.490297
0.0955411 0.347634 0.589351
0.449734 0.341201 0.227459
⋮
0.412786 0.281343 0.303454
0.209337 0.354169 0.417261
0.37683 0.345307 0.273704
0.187584 0.411171 0.390831
0.401612 0.243119 0.350124
0.323155 0.338805 0.322453
0.488678 0.300927 0.23324
0.0979282 0.413296 0.522639
0.195902 0.313932 0.472582
Iterate rows of this matrix:
julia> eachrow([predict(model0) predict(model1) predict(model2)])
Base.Generator{Base.OneTo{Int64}, Base.var"#240#241"{Matrix{Float64}}}(Base.var"#240#241"{Matrix{Float64}}([0.5176063824396965 0.3142344514631397 0.2060615429588215; 0.17391563070921184 0.4315728844478078 0.3890711795309746; … ; 0.09792824142064335 0.41329629745776897 0.5226385962610233; 0.19590183503978997 0.31393218775269705 0.4725817014561341]), Base.OneTo(100))
For each row get index of maximum value:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)]))
100-element Vector{Int64}:
1
2
3
3
1
3
3
3
1
⋮
1
3
1
2
1
2
1
3
3
Subtract 1 from the result as Julia uses 1-based indexing, and you wanted the first model to have number 0:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Alternatively you could write:
julia> map(predict(model0), predict(model1), predict(model2)) do x...
return argmax(x) - 1
end
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Which is more efficient and shorter, but I was not sure if it is clearer as it uses slurping.
An example how to train one model for three classes using Flux.jl (still using the same df source data frame):
using Flux
model = Chain(Dense(3 => 3, σ), softmax)
X = permutedims(Matrix(df[:, 1:3]))
y = Flux.onehotbatch(df.target, 0:2)
optim = Flux.setup(Flux.Adam(0.01), model)
for epoch in 1:1_000
Flux.train!(model, [(X, y)], optim) do m, x, y
y_hat = m(x)
Flux.crossentropy(y_hat, y)
end
end
Personally, I choose the Julia implementation for it. So Bogumił Kamiński's answer would be superior to mine.
I don't know if any packages provide a multi-target/label Logistic Regression model implemented fully in Julia (I would like to know if there are any, I'll prepend them to this answer). But, you can apply the model using ScikitLearn.jl which is a wrapper for Scikit-learn in Python and uses a connected python session to run the code. You can get further information in their repository. But, I created synthetic data as similar as I could to what you have to train the model on it and show you how you can do it:
#- Packages
using DataFrames
using ScikitLearn
#sk_import linear_model: LogisticRegression
#- Syntethetic data
df = DataFrame(
x1=rand(100),
x2=rand(100),
x3=rand(100),
target=rand([0, 1, 2], 100)
)
# 100×4 DataFrame
# Row │ x1 x2 x3 target
# │ Float64 Float64 Float64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 0.607024 0.6818 0.562058 0
# 2 │ 0.235538 0.974469 0.553292 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 99 │ 0.382491 0.224192 0.122515 1
# 100 │ 0.617425 0.793276 0.228549 0
#- Split data
train, test = df[1:80, :], df[81:end, :]
Then I train the LogisticRegresssion (which inherently runs the same object in the sklearn.py):
#- Train model
model = LogisticRegression(multi_class="ovr")
fit!(model, Matrix(train[:, 1:3]), train[:, 4])
And the last phase would be the prediction:
#- Predict
preds = predict(model, Matrix(test[:, 1:3]));
#- Count right predictions
sum(preds .== test[:, 4])
# returns `6` in my case
Note that you need to install PyCall.jl to use ScikitLearn.jl. Make sure to follow the instructions provided by the PyCall.jl to set up the required environment first.
I have two arrays with same dimension:
a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]
And I want to group both of them with respect to array a1 and get the mean of the array a2 for each group.
My output is coming from array a2, as mentioned below:
result:
1.5
3.0
4.0
5.5
Please suggest an approach to achieve this task.
Thanks!!
Here is a solution using DataFrames.jl:
julia> using DataFrames, Statistics
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
EDIT:
Here are the timings (as usual in Julia you need to remember that the first time you run some function it has to be compiled which takes time):
julia> using DataFrames, Statistics
(#v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
Status `D:\.julia\environments\v1.6\Project.toml`
[a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups
julia> #time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)
julia> #time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
0.442082 seconds (294 allocations: 762.990 MiB)
Note that e.g. data.table (if this is your reference) on similar data is noticeably slower:
> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
user system elapsed
4.72 1.20 2.00
In case you are interested in using Chain.jl in addition to DataFrames.jl, Bogumił Kamiński's answer might then look like this:
julia> using DataFrames, Statistics, Chain
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> #chain df begin
groupby(:a1)
combine(:a2 => mean)
end
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
How can I concatenate arrays of different size with a "filler" value where the arrays don't line up?
a = [1,2,3]
b = [1,2]
And I would like:
[1 2 3
1 2 missing]
Or
[1 2 3
1 2 nothing]
One way, using rstack which is "ragged stack". It always places arrays along one new dimension, thus given vectors, they form the columns of a matrix. (The original question may want the transpose of this result.)
julia> using LazyStack
julia> rstack(a, b; fill=missing)
3×2 Matrix{Union{Missing, Int64}}:
1 1
2 2
3 missing
julia> rstack(a, b, reverse(a), reverse(b); fill=NaN)
3×4 Matrix{Real}:
1 1 3 2
2 2 2 1
3 NaN 1 NaN
I can't find an answer to this simple question.
I have the following:
A(a,j)=[a*j*i*k for i in 1:2, k in 1:2];
B=[A(a,j) for a in 1:2, j in 1:2];
B is a an array of arrays: 2×2 Array{Array{Int64,2},2}. This is useful to easily access the subarrays with indices (e.g., B[2,1]). However, I also need to convert B to a 4 by 4 matrix. I tried hcat(B...) but that yields a 2 by 8 matrix, and other options are worse (e.g., cat(Test2...;dims=(2,1))).
Is there an efficient way of writing B as a matrix while keeping the ability to easily access its subarrays, especially as B gets very large?
Do you want this:
julia> hvcat(size(B,1), B...)
4×4 Array{Int64,2}:
1 2 2 4
2 4 4 8
2 4 4 8
4 8 8 16
or without defining B:
julia> hvcat(2, (A(a,j) for a in 1:2, j in 1:2)...)
4×4 Array{Int64,2}:
1 2 2 4
2 4 4 8
2 4 4 8
4 8 8 16
What about
B = reduce(hcat, reduce(vcat, A(a,j) for a in 1:2) for j in 1:2)
EDIT: Actually this is very slow, I would recommend making a function, e.g.,
function buildB(A, n)
A0 = A(1,1)
nA = size(A0, 1)
B = Array{eltype(A0),2}(undef, n * nA, n * nA)
for a in 1:n, j in 1:n
B[(a-1)*nA .+ (1:nA), (j-1)*nA .+ (1:nA)] .= A(a,j)
end
return B
end
or maybe consider a package like BlockArrays.jl?
EDIT 2 This is an example with BlockArrays.jl:
using BlockArrays
function blockarrays(A, n)
A0 = A(1,1)
nA = size(A0, 1)
B = BlockArray{eltype(A0)}(undef_blocks, fill(nA,n), fill(nA,n))
for a in 1:n, j in 1:n
setblock!(B, A(a,j), a, j)
end
return B
end
which should do what you want:
julia> B = blockarrays(A, 2)
2×2-blocked 4×4 BlockArray{Int64,2}:
1 2 │ 2 4
2 4 │ 4 8
──────┼───────
2 4 │ 4 8
4 8 │ 8 16
julia> getblock(B, 1, 2)
2×2 Array{Int64,2}:
2 4
4 8
julia> B[4,2]
8