I have an array X that I'd like to convert to a dataframe. Upon recommendation from the web, I tried converting to a dataframe and get the following error.
julia> y=convert(DataFrame,x)
ERROR:converthas no method matching convert(::Type{DataFrame}, ::Array{Float64,2})
in convert at base.jl:13
When I try DataFrame(x), the conversion works but i get a complaint that the conversion is deprecated.
julia> DataFrame(x)
WARNING: DataFrame(::Matrix, ::Vector)) is deprecated, use convert(DataFrame, Matrix) instead in DataFrame at /Users/Matthew/.julia/v0.3/DataFrames/src/deprecated.jl:54 (repeats 2 times)
Is there another method I should be aware of to keep my code consistent?
EDIT:
Julia 0.3.2,
DataFrames 0.5.10
OSX 10.9.5
julia> x=rand(4,4)
4x4 Array{Float64,2}:
0.467882 0.466358 0.28144 0.0151388
0.22354 0.358616 0.669564 0.828768
0.475064 0.187992 0.584741 0.0543435
0.0592643 0.345138 0.704496 0.844822
julia> convert(DataFrame,x)
ERROR: `convert` has no method matching convert(::Type{DataFrame}, ::Array{Float64,2}) in convert at base.jl:13
This works for me:
julia> using DataFrames
julia> x = rand(4, 4)
4x4 Array{Float64,2}:
0.790912 0.0367989 0.425089 0.670121
0.243605 0.62487 0.582498 0.302063
0.785159 0.0083891 0.881153 0.353925
0.618127 0.827093 0.577815 0.488565
julia> convert(DataFrame, x)
4x4 DataFrame
| Row | x1 | x2 | x3 | x4 |
|-----|----------|-----------|----------|----------|
| 1 | 0.790912 | 0.0367989 | 0.425089 | 0.670121 |
| 2 | 0.243605 | 0.62487 | 0.582498 | 0.302063 |
| 3 | 0.785159 | 0.0083891 | 0.881153 | 0.353925 |
| 4 | 0.618127 | 0.827093 | 0.577815 | 0.488565 |
Are you trying something different?
If that doesn't work try posting a bit more code we can help you better.
Since this is the first thing that comes up when you google, for more recent versions of DataFrames.jl, you can just use the DataFrame() function now:
julia> x = rand(4,4)
4×4 Matrix{Float64}:
0.920406 0.738911 0.994401 0.9954
0.18791 0.845132 0.277577 0.231483
0.361269 0.918367 0.793115 0.988914
0.725052 0.962762 0.413111 0.328261
julia> DataFrame(x, :auto)
4×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────
1 │ 0.920406 0.738911 0.994401 0.9954
2 │ 0.18791 0.845132 0.277577 0.231483
3 │ 0.361269 0.918367 0.793115 0.988914
4 │ 0.725052 0.962762 0.413111 0.328261
I've been confounded by the same issue a number of times, and eventually realized the issue is often related to the format of the array, and is easily resolved by simply transposing the array prior to conversion.
In short, I recommend:
julia> convert(DataFrame, x')
# convert a Matrix{Any} with a header row of col name strings to a DataFrame
# e.g. mat2df(["a" "b" "c"; 1 2 3; 4 5 6])
mat2df(mat) = convert(DataFrame,Dict(mat[1,:],
[mat[2:end,i] for i in 1:size(mat,2)]))
# convert a Matrix{Any} (mat) and a list of col name strings (headerstrings)
# to a DataFrame, e.g. matnms2df([1 2 3;4 5 6], ["a","b","c"])
matnms2df(mat, headerstrs) = convert(DataFrame,
Dict(zip(headerstrs,[mat[:,i] for i in 1:size(mat,2)])))
A little late, but with the update to the DataFrame() function, I created a custom function that would take a matrix (e.g. an XLSX imported dataset) and convert it into a DataFrame using the first row as column headers. Saves me a ton of time and, hopefully, it helps you too.
function MatrixToDataFrame(mat)
DF_mat = DataFrame(
mat[2:end, 1:end],
string.(mat[1, 1:end])
)
return DF_mat
end
So I found this online and honestly felt dumb.
using CSV
WhatIWant = DataFrame(WhatIHave)
this was adapted from an R guide, but it works so heck
DataFrame([1 2 3 4; 5 6 7 8; 9 10 11 12], :auto)
This works as per >? DataFrame
Related
How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15
I’m new to Julia and i am trying to implement One-Vs-Rest Multi-Class Classification, and I was wondering if anyone could help me out. Here is a snippet of my code so far:
My data frame is basic since I’m trying to figure out the implementation first, my c column is my class consisting of [0, 1, 2], and my y, x1, x2, x3 are random Int64 values.
using DataFrames
using CSV
using StatsBase
using StatsModels
using Statistics
using Plots, StatsPlots
using GLM
using Lathe
df = DataFrame(CSV.File(“data.csv”))
fm = #formula(c~x1+x2+x3+y)
model0 = glm(fm0, df, Binomial(), ProbitLink()) # 0 vs [1,2]
model1 = glm(fm1, df, Binomial(), ProbitLink()) # 1 vs [0,2]
model2 = glm(fm2, df, Binomial(), ProbitLink()) # 2 vs [0,1]
I am trying to make logistic models but I don’t know how to do it.
If anyone can help me out, I would be thrilled.
I am trying to split the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
My only problem is that I don't how to write the logistic model for a multi-class dataset.
Here is how you can do the same manually using GLM.jl (there is a lot of boilerplate code, but I wanted to keep the example simple):
df = DataFrame(x1=rand(100), x2=rand(100), x3=rand(100), target=rand([0, 1, 2], 100));
model0 = glm(#formula((target==0)~x1+x2+x3), df, Binomial(), ProbitLink())
model1 = glm(#formula((target==1)~x1+x2+x3), df, Binomial(), ProbitLink())
model2 = glm(#formula((target==2)~x1+x2+x3), df, Binomial(), ProbitLink())
choice = argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1 # need to subtract 1 to use 0-based indexing
Let me explain the last operation step by step:
get the predictions of three models as columns of a matrix
julia> [predict(model0) predict(model1) predict(model2)]
100×3 Matrix{Float64}:
0.517606 0.314234 0.206062
0.173916 0.431573 0.389071
0.211322 0.355592 0.413929
0.252108 0.337381 0.387629
0.515388 0.306834 0.211937
0.169052 0.386062 0.436603
0.125764 0.395105 0.490297
0.0955411 0.347634 0.589351
0.449734 0.341201 0.227459
⋮
0.412786 0.281343 0.303454
0.209337 0.354169 0.417261
0.37683 0.345307 0.273704
0.187584 0.411171 0.390831
0.401612 0.243119 0.350124
0.323155 0.338805 0.322453
0.488678 0.300927 0.23324
0.0979282 0.413296 0.522639
0.195902 0.313932 0.472582
Iterate rows of this matrix:
julia> eachrow([predict(model0) predict(model1) predict(model2)])
Base.Generator{Base.OneTo{Int64}, Base.var"#240#241"{Matrix{Float64}}}(Base.var"#240#241"{Matrix{Float64}}([0.5176063824396965 0.3142344514631397 0.2060615429588215; 0.17391563070921184 0.4315728844478078 0.3890711795309746; … ; 0.09792824142064335 0.41329629745776897 0.5226385962610233; 0.19590183503978997 0.31393218775269705 0.4725817014561341]), Base.OneTo(100))
For each row get index of maximum value:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)]))
100-element Vector{Int64}:
1
2
3
3
1
3
3
3
1
⋮
1
3
1
2
1
2
1
3
3
Subtract 1 from the result as Julia uses 1-based indexing, and you wanted the first model to have number 0:
julia> argmax.(eachrow([predict(model0) predict(model1) predict(model2)])) .- 1
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Alternatively you could write:
julia> map(predict(model0), predict(model1), predict(model2)) do x...
return argmax(x) - 1
end
100-element Vector{Int64}:
0
1
2
2
0
2
2
2
0
⋮
0
2
0
1
0
1
0
2
2
Which is more efficient and shorter, but I was not sure if it is clearer as it uses slurping.
An example how to train one model for three classes using Flux.jl (still using the same df source data frame):
using Flux
model = Chain(Dense(3 => 3, σ), softmax)
X = permutedims(Matrix(df[:, 1:3]))
y = Flux.onehotbatch(df.target, 0:2)
optim = Flux.setup(Flux.Adam(0.01), model)
for epoch in 1:1_000
Flux.train!(model, [(X, y)], optim) do m, x, y
y_hat = m(x)
Flux.crossentropy(y_hat, y)
end
end
Personally, I choose the Julia implementation for it. So Bogumił Kamiński's answer would be superior to mine.
I don't know if any packages provide a multi-target/label Logistic Regression model implemented fully in Julia (I would like to know if there are any, I'll prepend them to this answer). But, you can apply the model using ScikitLearn.jl which is a wrapper for Scikit-learn in Python and uses a connected python session to run the code. You can get further information in their repository. But, I created synthetic data as similar as I could to what you have to train the model on it and show you how you can do it:
#- Packages
using DataFrames
using ScikitLearn
#sk_import linear_model: LogisticRegression
#- Syntethetic data
df = DataFrame(
x1=rand(100),
x2=rand(100),
x3=rand(100),
target=rand([0, 1, 2], 100)
)
# 100×4 DataFrame
# Row │ x1 x2 x3 target
# │ Float64 Float64 Float64 Int64
# ─────┼──────────────────────────────────────────
# 1 │ 0.607024 0.6818 0.562058 0
# 2 │ 0.235538 0.974469 0.553292 1
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 99 │ 0.382491 0.224192 0.122515 1
# 100 │ 0.617425 0.793276 0.228549 0
#- Split data
train, test = df[1:80, :], df[81:end, :]
Then I train the LogisticRegresssion (which inherently runs the same object in the sklearn.py):
#- Train model
model = LogisticRegression(multi_class="ovr")
fit!(model, Matrix(train[:, 1:3]), train[:, 4])
And the last phase would be the prediction:
#- Predict
preds = predict(model, Matrix(test[:, 1:3]));
#- Count right predictions
sum(preds .== test[:, 4])
# returns `6` in my case
Note that you need to install PyCall.jl to use ScikitLearn.jl. Make sure to follow the instructions provided by the PyCall.jl to set up the required environment first.
I'm a beginner in spark and I'm dealing with a large dataset (over 1.5 Million rows and 2 columns). I have to evaluate the Cosine Similarity of the field "features" beetween each row. The main problem is this iteration beetween the rows and finding an efficient and rapid method. I will have to use this method with another dataset of 42.5 Million rows and it would be a big computational problem if I won't find the most efficient way of doing it.
| post_id | features |
| -------- | -------- |
| Bkeur23 |[cat,dog,person] |
| Ksur312kd |[wine,snow,police] |
| BkGrtTeu3 |[] |
| Fwd2kd |[person,snow,cat] |
I've created an algorithm that evaluates this cosine similarity beetween each element of the i-th and j-th row but i've tried using lists or creating a spark DF / RDD for each result and merging them using the" union" function.
The function I've used to evaluate the cosineSimilarity is the following. It takes 2 lists in input ( the lists of the i-th and j-th rows) and returns the maximum value of the cosine similarity between each couple of elements in the lists. But this is not the problem.
def cosineSim(lista1,lista2,embed):
#embed = hub.KerasLayer(os.getcwd())
eps=sys.float_info.epsilon
if((lista1 is not None) and (lista2 is not None)):
if((len(lista1)>0) and (len(lista2)>0)):
risultati={}
for a in lista1:
tem = a
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x1 = x[0].tolist()
for b in lista2:
tem = b
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x2 = x[0].tolist()
sum = 0
suma1 = 0
sumb1 = 0
for i,j in zip(x1, x2):
suma1 += i * i
sumb1 += j*j
sum += i*j
cosine_sim = sum / ((sqrt(suma1))*(sqrt(sumb1))+eps)
risultati[a+'-'+b]=cosine_sim
cosine_sim=0
risultati=max(risultati.values())
return risultati
The function I'm using to iterate over the rows is the following one:
def iterazione(df,numero,embed):
a=1
k=1
emp_RDD = spark.sparkContext.emptyRDD()
columns1= StructType([StructField('Source', StringType(), False),
StructField('Destination', StringType(), False),
StructField('CosinSim',FloatType(),False)])
first_df = spark.createDataFrame(data=emp_RDD,
schema=columns1)
for i in df:
for j in islice(df, a, None):
r=cosineSim(i[1],j[1],embed)
if(r>0.45):
z=spark.createDataFrame(data=[(i[0],j[0],r)],schema=columns1)
first_df=first_df.union(z)
k=k+1
if(k==numero):
k=a+1
a=a+1
return first_df
The output I desire is something like this:
| Source | Dest | CosinSim |
| -------- | ---- | ------ |
| Bkeur23 | Ksur312kd | 0.93 |
| Bkeur23 | Fwd2kd | 0.673 |
| Ksur312kd | Fwd2kd | 0.76 |
But there is a problem in my "iterazione" function.
I ask you to help me finding the best way to iterate all over this rows. I was thinking also about copying the column "features" as "features2" and applying my function using WithColumn but I don't know how to do it and if it will work. I want to know if there's some method to do it directly in a spark dataframe, avoiding the creation of other datasets and merging them later, or if you know some method more rapid and efficient. Thank you!
I have two arrays with same dimension:
a1 = [1,1,3,4,6,6]
a2 = [1,2,3,4,5,6]
And I want to group both of them with respect to array a1 and get the mean of the array a2 for each group.
My output is coming from array a2, as mentioned below:
result:
1.5
3.0
4.0
5.5
Please suggest an approach to achieve this task.
Thanks!!
Here is a solution using DataFrames.jl:
julia> using DataFrames, Statistics
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> combine(groupby(df, :a1), :a2 => mean)
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
EDIT:
Here are the timings (as usual in Julia you need to remember that the first time you run some function it has to be compiled which takes time):
julia> using DataFrames, Statistics
(#v1.6) pkg> st DataFrames # I am using main branch, as it should be released this week
Status `D:\.julia\environments\v1.6\Project.toml`
[a93c6f00] DataFrames v0.22.7 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a1=rand(1:1000, 10^8), a2=rand(10^8)); # 10^8 rows in 1000 random groups
julia> #time combine(groupby(df, :a1), :a2 => mean); # first run includes compilation time
3.781717 seconds (6.76 M allocations: 1.151 GiB, 6.73% gc time, 84.20% compilation time)
julia> #time combine(groupby(df, :a1), :a2 => mean); # second run is just execution time
0.442082 seconds (294 allocations: 762.990 MiB)
Note that e.g. data.table (if this is your reference) on similar data is noticeably slower:
> library(data.table) # using 4 threads
> df = data.table(a1 = sample(1:1000, 10^8, replace=T), a2 = runif(10^8));
> system.time(df[, .(mean(a2)), by = a1])
user system elapsed
4.72 1.20 2.00
In case you are interested in using Chain.jl in addition to DataFrames.jl, Bogumił Kamiński's answer might then look like this:
julia> using DataFrames, Statistics, Chain
julia> df = DataFrame(a1 = [1,1,3,4,6,6], a2 = [1,2,3,4,5,6]);
julia> #chain df begin
groupby(:a1)
combine(:a2 => mean)
end
4×2 DataFrame
Row │ a1 a2_mean
│ Int64 Float64
─────┼────────────────
1 │ 1 1.5
2 │ 3 3.0
3 │ 4 4.0
4 │ 6 5.5
I know that it is possible to convert a Float64 into an Int64
using the convert function.
Unfortunately, it doesn't work when applying convert to a 2-D array.
julia> convert(Int64, 2.0)
2
julia> A = [1.0 2.0; 3.0 4.0]
2x2 Array{Float64,2}:
1.0 2.0
3.0 4.0
julia> convert(Int64, A)
ERROR: `convert` has no method matching convert(::Type{Int64}, ::Array{Float64,2
})
in convert at base.jl:13
How do I convert a 2-D array of floats into a 2-D array of ints?
What I tried
I could do it using the following code,
which is a little verbose but it works.
I am hoping there is an easier way to do it though.
julia> A = [1.0 2.0; 3.0 4.0]
2x2 Array{Float64,2}:
1.0 2.0
3.0 4.0
julia> B = Array(Int64, 2, 2)
2x2 Array{Int64,2}:
4596199964293150115 4592706631984861405
4604419156384151675 0
julia> for i = 1:2
for j = 1:2
B[i,j] = convert(Int64,A[i,j])
end
end
julia> B
2x2 Array{Int64,2}:
1 2
3 4
An answer that doesn't work for me
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.3.10 (2015-06-24 13:54 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org release
|__/ | x86_64-linux-gnu
julia> A = [1.2 3.4; 5.6 7.8]
2x2 Array{Float64,2}:
1.2 3.4
5.6 7.8
julia> round(Int64, A)
ERROR: `round` has no method matching round(::Type{Int64}, ::Array{Float64,2})
You can convert a 2x2 array of floats into a 2x2 array of ints very easily, after you decide how you want rounding to be handled:
julia> A = [1.0 -0.3; 3.9 4.5]
2x2 Array{Float64,2}:
1.0 -0.3
3.9 4.5
julia> round.(Int, A)
2x2 Array{Int64,2}:
1 0
4 4
julia> floor.(Int, A)
2x2 Array{Int64,2}:
1 -1
3 4
julia> trunc.(Int, A)
2x2 Array{Int64,2}:
1 0
3 4
julia> ceil.(Int, A)
2x2 Array{Int64,2}:
1 0
4 5
You can use map, which preserves the matrix dimensions, and does not depend on vectorized methods:
julia> x = rand(2,2)
2x2 Array{Float64,2}:
0.279777 0.610333
0.277234 0.947914
julia> map(y->round(Int,y), x)
2x2 Array{Int64,2}:
0 1
0 1
This answer is for Julia v0.3. For newer versions, see answer of DSM
Use the int function:
julia> a = rand(2,2)
2x2 Array{Float64,2}:
0.145651 0.362497
0.879268 0.753001
julia> int(a)
2x2 Array{Int64,2}:
0 0
1 1