DataFrames to Database tables - database

geniouses. I am a newbie in Julia, but have an ambitious.
I am trying to a following stream so far, of course it's an automatic process.
read data from csv file to DataFrames
Checking the data then cerate DB tables due to DataFrames data type
Insert data from DataFrames to the created table ( eg. SQLite )
I am sticking at No.2 now, because, for example the column's data type 'Vector{String15}'.
I am struggling how can I reflect the datatype to the query of creating table.
I mean I could not find any solutions below (a) (b).
fname = string( #__DIR__,"/","testdata/test.csv")
df = CSV.read( fname, DataFrame )
last = ncol(df)
for i = 1:last
col[i] = typeof(df[!,i]) # ex. Vector{String15}
if String == col[i] # (a) does not work
# create table sql
# expect
query = 'create table testtable( col1 varchar(15),....'
elseif Int == col[i] # (b) does not work
# create table sql
# expect
query = 'create table testtable( col1 int,....'
end
・
     ・
end
I am wonderring,
I really have to get the type of table column from 'Vector{String15}' anyhow?
Does DataFrames has an utility method to do it?
Should combine with other module to do it?
I am expecting smart tips by you, thanks any advances.

Here is how you can do it both ways:
julia> using DataFrames
julia> using CSV
julia> df = CSV.read("test.csv", DataFrame)
3×3 DataFrame
Row │ a b c
│ String15 Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
julia> using SQLite
julia> db = SQLite.DB("test.db")
SQLite.DB("test.db")
julia> SQLite.load!(df, db, "df")
"df"
julia> SQLite.columns(db, "df")
(cid = [0, 1, 2], name = ["a", " b", " c"], type = ["TEXT", "INT", "REAL"], notnull = [1, 1, 1], dflt_value = [missing, missing, missing], pk = [0, 0, 0])
julia> query = DBInterface.execute(db, "SELECT * FROM df")
SQLite.Query(SQLite.Stmt(SQLite.DB("test.db"), 4), Base.RefValue{Int32}(100), [:a, Symbol(" b"), Symbol(" c")], Type[Union{Missing, String}, Union{Missing, Int64}, UnionMissing, Float64}], Dict(:a => 1, Symbol(" c") => 3, Symbol(" b") => 2), Base.RefValue{Int64}(0))
julia> DataFrame(query)
3×3 DataFrame
Row │ a b c
│ String Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
If you would need more explanations this is covered in chapter 8 of Julia for Data Analysis. This chapter should be available on MEAP in 1-2 weeks (and the source code is already available at https://github.com/bkamins/JuliaForDataAnalysis)

Related

How to convert PySpark dataframe columns into list of dictionary based on groupBy column

I'm converting dataframe columns into list of dictionary.
Input dataframe has 3 columns:
ID accounts pdct_code
1 100 IN
1 200 CC
2 300 DD
2 400 ZZ
3 500 AA
I need to read this input dataframe and convert it into 3 output rows. The output should look like this:
ID arrayDict
1 [{“accounts”: 100, “pdct_cd”: ’IN’}, {”accounts”: 200, “pdct_cd”: ’CC’}]
Similarly, for ID "2" there should be 1 row with 2 dictionaries with key value pair.
I tried this:
Df1 = df.groupBy("ID").agg(collect_list(struct(col("accounts"), ("pdct_cd"))).alias("array_dict"))
But output is not quite as I wanted which should be a list of dictionary.
What you described (list of dictionary) doesn't exist in Spark. Instead of lists we have arrays, instead of dictionaries we have structs or maps. Since you didn't operate these terms, this will be a loose interpretation of what I think you need.
The following will create arrays of strings. Those strings will have the structure which you probably want.
df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code")))
struct() puts your column inside a struct data type.
to_json() creates a JSON string out of the provided struct.
collect_list() is an aggregation function which moves all the strings of the group into an array.
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 100, "IN"),
(1, 200, "CC"),
(2, 300, "DD"),
(2, 400, "ZZ"),
(3, 500, "AA")],
["ID", "accounts", "pdct_code"])
df = df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code"))).alias("array_dict"))
df.show(truncate=0)
# +---+----------------------------------------------------------------------+
# |ID |array_dict |
# +---+----------------------------------------------------------------------+
# |1 |[{"accounts":100,"pdct_code":"IN"}, {"accounts":200,"pdct_code":"CC"}]|
# |3 |[{"accounts":500,"pdct_code":"AA"}] |
# |2 |[{"accounts":300,"pdct_code":"DD"}, {"accounts":400,"pdct_code":"ZZ"}]|
# +---+----------------------------------------------------------------------+

Julia - How to conver DataFrame to Array?

I have a DataFrame containing only numerical values. Now, what I'd like to do is extract all the values of this DataFrame as an Array. How can I do this? I know that for a single column, if I do df[!,:x1], then the output is an array. But how to do this for all the columns?
The shortest form seems to be:
julia> Matrix(df)
3×2 Array{Float64,2}:
0.723835 0.307092
0.02993 0.0147598
0.141979 0.0271646
In some scenarios you might want need to specify the type such as Matrix{Union{Missing, Float64}}(df)
convert(Matrix, df[:,:])
Try this
You can also use the Tables API for this, in particular the Tables.matrix function:
julia> df = DataFrame(x=rand(3), y=rand(3))
3×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼─────────────────────
1 │ 0.33002 0.180934
2 │ 0.834302 0.470976
3 │ 0.0916842 0.45172
julia> Tables.matrix(df)
3×2 Array{Float64,2}:
0.33002 0.180934
0.834302 0.470976
0.0916842 0.45172

Reading CSV file in loop Dataframe (Julia)

I want to read multiple CSV files with changing names like "CSV_1.csv" and so on.
My idea was to simply implement a loop like the following
using CSV
for i = 1:8
a[i] = CSV.read("0.$i.csv")
end
but obviously that won't work.
Is there a simple way of implementing this, like introducing a additional dimension in the dataframe?
Assuming a in this case is an array, this is definitely possible, but to do it this way, you'd need to pre-allocate your array, since you can't assign an index that doesn't exist yet:
julia> a = []
0-element Array{Any,1}
julia> a[1] = 1
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [1]
Stacktrace:
[1] setindex!(::Array{Any,1}, ::Any, ::Int64) at ./essentials.jl:455
[2] top-level scope at REPL[10]:1
julia> a2 = Vector{Int}(undef, 5);
julia> for i in 1:5
a2[i] = i
end
julia> a2
5-element Array{Int64,1}:
1
2
3
4
5
Alternatively, you can use push!() to add things to an array as you need.
julia> a3 = [];
julia> for i in 1:5
push!(a3, i)
end
julia> a3
5-element Array{Any,1}:
1
2
3
4
5
So for your CSV files,
using CSV
a = []
for i = 1:8
push!(a, CSV.read("0.$i.csv"))
end
You can alternatively to what Kevin proposed write:
# read in the files into a vector
a = CSV.read.(["0.$i.csv" for i in 1:8])
# add an indicator column
for i in 1:8
a[i][!, :id] .= i
end
# create a single data frame with indicator column holding the source
b = reduce(vcat, a)
You can read an arbitrary number of CSV files with a certain pattern in the file name, create a dataframe per file and lastly, if you want, create a single dataframe.
using CSV, Glob, DataFrames
path = raw"C:\..." # directory of your files (raw is useful in Windows to add a \)
files=glob("*.csv", path) # to load all CSVs from a folder (* means arbitrary pattern)
dfs = DataFrame.( CSV.File.( files ) ) # creates a list of dataframes
# add an index column to be able to later discern the different sources
for i in 1:length(dfs)
dfs[i][!, :sample] .= i # I called the new col sample
end
# finally, reduce your collection of dfs via vertical concatenation
df = reduce(vcat, dfs)

multiple array to dataframe pandas

So, I am iterating through a dictionary and taking a bunch of values out as a array - Trying to make a Dataframe with each observation as a separate row.
X1 =[]
for k,v in DF_grp:
date = v['Date'].astype(datetime)
usage = v['Usage'].astype(float)
comm = v['comm'].astype(float)
mdf = pd.DataFrame({'Id' : k[0],'date':date,'usage':usage, 'comm':comm})
mdf['used_ratio'] = ((mdf['used']/mdf['comm']).round(2))*100
ts = pd.Series(mdf['usage'].values, index=mdf['date']).sort_index(ascending=True)
ts2 = pd.Series(mdf['used_ratio'].values, index = mdf['date']).sort_index(ascending=True)
ts2 = ts2.dropna()
data = ts2.values.copy()
if len(data) == 10:
X1 =np.append(X1,data, axis=0)
print(X1)
[0,0,0,0,1,0,0,0,1]
[1,2,3,4,5,6,7,8,9]
[0,5,6,7,8,9,1,2,3]
....
similarly, so the question is how do I capture all these arrays in a single DataFrame so that it looks like below:
[[0,0,0,0,1,0,0,0,1]] --- #row 1 in dataframe
[[1,2,3,4,5,6,7,8,9]] --- #row 2 in dataframe
If the same task can be divided further ?
There are more thank 500K arrays in the dataset.
Thank You
I hope below mentioned code helps you:
arr2 = [0,0,0,0,1,0,0,0,1]
arr3 = [1,2,3,4,5,6,7,8,9]
arr4 = [0,5,6,7,8,9,1,2,3]
li = [arr2, arr3, arr4]
pd.DataFrame(data = li, columns= ["c1", "c2", "c3", "c4", "c5","c6", "c7", "c8", "c9"])
You can make it more dynamic by simply creating one temp_arr and appending that array to list. and creating data frame from generated list of arrays. Also, you can add name to columns(shown above) or avoid naming them(just remove column detailing). I hope that solves your problem
Declare an empty dataframe in second line i.e. below X1=[] with code df = pd.DataFrame(). Next, inside your IF statement pass the following after appending values to X1:
df = pd.concat([df, pd.Series(X1)]).T
Or,
df = pd.DataFrame(np.NaN, index=range(3), columns=range(9))
for i in range(3):
df.iloc[i,:] = np.random.randint(9) # <----- Pass X1 here
df
# 0 1 2 3 4 5 6 7 8
# 0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
# 1 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
# 2 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0

Create array of literals and columns from List of Strings in Spark

I am trying to define functions in Scala that take a list of strings as input, and converts them into the columns passed to the dataframe array arguments used in the code below.
val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val df2 = df
.withColumn("columnArray",array(df("foo").cast("String"),df("bar").cast("String")))
.withColumn("litArray",array(lit("foo"),lit("bar")))
More specifically, I would like to create functions colFunction and litFunction (or just one function if possible) that takes a list of strings as an input parameter and can be used as follows:
val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val colString = List("foo","bar")
val df2 = df
.withColumn("columnArray",array(colFunction(colString))
.withColumn("litArray",array(litFunction(colString)))
I have tried mapping the colString to an Array of columns with all the transformations but this doesn't work.
Spark 2.2+:
Support for Seq, Map and Tuple (struct) literals has been added in SPARK-19254. According to tests:
import org.apache.spark.sql.functions.typedLit
typedLit(Seq("foo", "bar"))
Spark < 2.2
Just map with lit and wrap with array:
def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)
df.withColumn("an_array", asLitArray(colString)).show
// +---+---+----------+
// |foo|bar| an_array|
// +---+---+----------+
// | 1| 1|[foo, bar]|
// | 2| 2|[foo, bar]|
// | 3| 3|[foo, bar]|
// +---+---+----------+
Regarding transformation from Seq[String] to Column of type Array this functionality is already provided by:
def array(colName: String, colNames: String*): Column
or
def array(cols: Column*): Column
Example:
val cols = Seq("bar", "foo")
cols match { case x::xs => df.select(array(x, xs:_*))
// or
df.select(array(cols map col: _*))
Of course all columns have to be of the same type.
To create df containing array type column (3 alternatives):
val df = Seq(
(Seq("foo", "bar")),
(Seq("baz", "qux")),
).toDF("col_name")
val df = Seq(
(Array("foo", "bar")),
(Array("baz", "qux")),
).toDF("col_name")
val df = Seq(
(List("foo", "bar")),
(List("baz", "qux")),
).toDF("col_name")
To add col of array type:
providing existent col names
df.withColumn("new_col", array("col1", "col2"))
providing a list of existent col names
df.withColumn("new_col", array(list_of_str map col: _*))
providing literal values (2 alternatives)
df.withColumn("new_col", typedLit(Seq("foo", "bar")))
df.withColumn("new_col", array(lit("foo"), lit("bar")))
providing a list of literal values (2 alternatives)
df.withColumn("new_col", typedLit(list_of_str))
df.withColumn("new_col", array(list_of_str map lit: _*))

Resources