multiple array to dataframe pandas - arrays

So, I am iterating through a dictionary and taking a bunch of values out as a array - Trying to make a Dataframe with each observation as a separate row.
X1 =[]
for k,v in DF_grp:
date = v['Date'].astype(datetime)
usage = v['Usage'].astype(float)
comm = v['comm'].astype(float)
mdf = pd.DataFrame({'Id' : k[0],'date':date,'usage':usage, 'comm':comm})
mdf['used_ratio'] = ((mdf['used']/mdf['comm']).round(2))*100
ts = pd.Series(mdf['usage'].values, index=mdf['date']).sort_index(ascending=True)
ts2 = pd.Series(mdf['used_ratio'].values, index = mdf['date']).sort_index(ascending=True)
ts2 = ts2.dropna()
data = ts2.values.copy()
if len(data) == 10:
X1 =np.append(X1,data, axis=0)
print(X1)
[0,0,0,0,1,0,0,0,1]
[1,2,3,4,5,6,7,8,9]
[0,5,6,7,8,9,1,2,3]
....
similarly, so the question is how do I capture all these arrays in a single DataFrame so that it looks like below:
[[0,0,0,0,1,0,0,0,1]] --- #row 1 in dataframe
[[1,2,3,4,5,6,7,8,9]] --- #row 2 in dataframe
If the same task can be divided further ?
There are more thank 500K arrays in the dataset.
Thank You

I hope below mentioned code helps you:
arr2 = [0,0,0,0,1,0,0,0,1]
arr3 = [1,2,3,4,5,6,7,8,9]
arr4 = [0,5,6,7,8,9,1,2,3]
li = [arr2, arr3, arr4]
pd.DataFrame(data = li, columns= ["c1", "c2", "c3", "c4", "c5","c6", "c7", "c8", "c9"])
You can make it more dynamic by simply creating one temp_arr and appending that array to list. and creating data frame from generated list of arrays. Also, you can add name to columns(shown above) or avoid naming them(just remove column detailing). I hope that solves your problem

Declare an empty dataframe in second line i.e. below X1=[] with code df = pd.DataFrame(). Next, inside your IF statement pass the following after appending values to X1:
df = pd.concat([df, pd.Series(X1)]).T
Or,
df = pd.DataFrame(np.NaN, index=range(3), columns=range(9))
for i in range(3):
df.iloc[i,:] = np.random.randint(9) # <----- Pass X1 here
df
# 0 1 2 3 4 5 6 7 8
# 0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
# 1 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
# 2 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0

Related

DataFrames to Database tables

geniouses. I am a newbie in Julia, but have an ambitious.
I am trying to a following stream so far, of course it's an automatic process.
read data from csv file to DataFrames
Checking the data then cerate DB tables due to DataFrames data type
Insert data from DataFrames to the created table ( eg. SQLite )
I am sticking at No.2 now, because, for example the column's data type 'Vector{String15}'.
I am struggling how can I reflect the datatype to the query of creating table.
I mean I could not find any solutions below (a) (b).
fname = string( #__DIR__,"/","testdata/test.csv")
df = CSV.read( fname, DataFrame )
last = ncol(df)
for i = 1:last
col[i] = typeof(df[!,i]) # ex. Vector{String15}
if String == col[i] # (a) does not work
# create table sql
# expect
query = 'create table testtable( col1 varchar(15),....'
elseif Int == col[i] # (b) does not work
# create table sql
# expect
query = 'create table testtable( col1 int,....'
end
・
     ・
end
I am wonderring,
I really have to get the type of table column from 'Vector{String15}' anyhow?
Does DataFrames has an utility method to do it?
Should combine with other module to do it?
I am expecting smart tips by you, thanks any advances.
Here is how you can do it both ways:
julia> using DataFrames
julia> using CSV
julia> df = CSV.read("test.csv", DataFrame)
3×3 DataFrame
Row │ a b c
│ String15 Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
julia> using SQLite
julia> db = SQLite.DB("test.db")
SQLite.DB("test.db")
julia> SQLite.load!(df, db, "df")
"df"
julia> SQLite.columns(db, "df")
(cid = [0, 1, 2], name = ["a", " b", " c"], type = ["TEXT", "INT", "REAL"], notnull = [1, 1, 1], dflt_value = [missing, missing, missing], pk = [0, 0, 0])
julia> query = DBInterface.execute(db, "SELECT * FROM df")
SQLite.Query(SQLite.Stmt(SQLite.DB("test.db"), 4), Base.RefValue{Int32}(100), [:a, Symbol(" b"), Symbol(" c")], Type[Union{Missing, String}, Union{Missing, Int64}, UnionMissing, Float64}], Dict(:a => 1, Symbol(" c") => 3, Symbol(" b") => 2), Base.RefValue{Int64}(0))
julia> DataFrame(query)
3×3 DataFrame
Row │ a b c
│ String Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
If you would need more explanations this is covered in chapter 8 of Julia for Data Analysis. This chapter should be available on MEAP in 1-2 weeks (and the source code is already available at https://github.com/bkamins/JuliaForDataAnalysis)

Reading CSV file in loop Dataframe (Julia)

I want to read multiple CSV files with changing names like "CSV_1.csv" and so on.
My idea was to simply implement a loop like the following
using CSV
for i = 1:8
a[i] = CSV.read("0.$i.csv")
end
but obviously that won't work.
Is there a simple way of implementing this, like introducing a additional dimension in the dataframe?
Assuming a in this case is an array, this is definitely possible, but to do it this way, you'd need to pre-allocate your array, since you can't assign an index that doesn't exist yet:
julia> a = []
0-element Array{Any,1}
julia> a[1] = 1
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [1]
Stacktrace:
[1] setindex!(::Array{Any,1}, ::Any, ::Int64) at ./essentials.jl:455
[2] top-level scope at REPL[10]:1
julia> a2 = Vector{Int}(undef, 5);
julia> for i in 1:5
a2[i] = i
end
julia> a2
5-element Array{Int64,1}:
1
2
3
4
5
Alternatively, you can use push!() to add things to an array as you need.
julia> a3 = [];
julia> for i in 1:5
push!(a3, i)
end
julia> a3
5-element Array{Any,1}:
1
2
3
4
5
So for your CSV files,
using CSV
a = []
for i = 1:8
push!(a, CSV.read("0.$i.csv"))
end
You can alternatively to what Kevin proposed write:
# read in the files into a vector
a = CSV.read.(["0.$i.csv" for i in 1:8])
# add an indicator column
for i in 1:8
a[i][!, :id] .= i
end
# create a single data frame with indicator column holding the source
b = reduce(vcat, a)
You can read an arbitrary number of CSV files with a certain pattern in the file name, create a dataframe per file and lastly, if you want, create a single dataframe.
using CSV, Glob, DataFrames
path = raw"C:\..." # directory of your files (raw is useful in Windows to add a \)
files=glob("*.csv", path) # to load all CSVs from a folder (* means arbitrary pattern)
dfs = DataFrame.( CSV.File.( files ) ) # creates a list of dataframes
# add an index column to be able to later discern the different sources
for i in 1:length(dfs)
dfs[i][!, :sample] .= i # I called the new col sample
end
# finally, reduce your collection of dfs via vertical concatenation
df = reduce(vcat, dfs)

Removing elements meeting criteria in both arrays MATLAB

I have two arrays of unequal length. Say A of a longer length and B of a shorter length. I wish to remove all elements from both A and B which meet a criteria - if there is a value in A which is between +/- 0.1 of a value in B then remove this element from both A and B. Remove only as many values from A as from B - ie., there can be non unique elements. If there are multiple elements that can be equivalently removed from A & B, remove the smaller element of B first and the larger element of A first.
Example:
A = [ 1 2 3 3 4 ]
B = [ 3.1, 2.9, 5]
Then 3 and 3 is removed from A and 3.1 and 2.9 is removed from B.
How do I do this in MATLAB?
You can use ismembertol:
A = [ 1 2 3 3 4 ];
B = [ 3.1, 2.9, 5];
Aind = ismembertol(A,B,0.1);
Bind = ismembertol(B,A,0.1);
A(Aind) = [];
B(Bind) = [];
ismembertol perform a comparison using a tolerance (0.1 in this case)
A similar result can also be achieve with:
lim = 0.1+10^-10 % +10^-10 so we avoid the floating point precision error.
Aind = any(abs(A-B.')<=lim,1)
Bind = any(abs(A-B.')<=lim,2)
A(Aind) = []
B(Bind) = []
Noticed that this second solution is not memory efficient. It is only suited for small array since it create a length(A)*length(B) matrice.

Python 2.7 convert 2d string array to float array

I read the following string on a .txt file
{{1,2,3,0},{4,5,6,7},{8,-1,9,0}}
using lin = lin.strip() to remove '\n'
Then I replaced { and } to [ and ] using
lin = lin.replace ("{", "[")
lin = lin.replace ("}", "]")
My goal is to convert lin into a float 2d array. So I did
my_matrix = np.array(lin, dtype=float)
but i got an error message: "ValueError: could not convert string to float: [[1,2,3,0],[1,1,1,2],[0,-1,3,9]]"
Removing the dtype, i get an string array. I already tried to multiply lin by 1.0, make a copy of lin using .astype(float), but nothing seems to work.
I am using the JSON library to parse the contents of the file and then iterate through the arrays and converting each element into float. However an integer solution might already be enough for what you want. That one is much faster and shorter.
import json
fc = '{{1,2,3,0},{4,5,6,7},{8,-1,9,0}}'
a = json.loads(fc.replace('{','[').replace('}',']'))
print(a) # a is now array of integers. this might be enough
for linenumber, linecontent in enumerate(a):
for elementnumber, element in enumerate(linecontent):
a[linenumber][elementnumber] = float(element)
print(a) # a is now array of floats
Shorter solution
import json
fc = '{{1,2,3,0},{4,5,6,7},{8,-1,9,0}}'
a = json.loads(fc.replace('{','[').replace('}',']'))
print(a) # a is now array of integers. this might be enough
a = [[float(c) for c in b] for b in a]
print(a) # a is now array of floats
(works for both python 2 and 3)
import numpy as np
readStr = "{{1,2,3,0},{4,5,6,7},{8,-1,9,0}}"
readStr = readStr[2:-2]
# Originally read string is now -> "1,2,3,0},{4,5,6,7},{8,-1,9,0"
line = readStr.split("},{")
# line is now a list object -> ["1,2,3,0", "4,5,6,7", "8,-1,9,0"]
array = []
temp = []
# Now we iterate through 'line', convert each element into a list, and
# then append said list to 'array' on each iteration of 'line'
for string in line:
num_array = string.split(',')
for num in num_array:
temp.append(num)
array.append(temp)
temp = []
# Now with 'array' -> [[1,2,3,0], [4,5,6,7], [8,-1,9,0]]
my_matrix = np.array(array, dtype = float)
# my_matrix = [[1.0, 2.0, 3.0, 0.0]
# [4.0, 5.0, 6.0, 7.0]
# [8.0, -1.0, 9.0, 0.0]]
Although this may not be the most elegant solution, I think it is easy to follow and gives you exactly what you're looking for.

Filtering of data based on condition using matlab

I have ref value as
ref = [9.8 13 10.51 12.2 10.45 11.4]
and In values as
In = [10.7 11 11.5 11.9 12]
I want to do following two things :
Identify which In value closest matches with ref value and then after
To check whether the matched In value is lower or higher than ref value. If it is lower than saved in array1 and if it is higher than saved in array2
See the following code snippet as one of many solutions:
% it would be a much better style
% to initialize the result vectors here properly!
a1 = [];
a2 = [];
for i=1:length(P_in)
[value, ind] = min(abs(P_in(i) - P_ref));
if P_in(i) <= P_ref(ind)
a1 = [a1 P_in(i)];
else
a2 = [a2 P_in(i)];
end;
end;
with the given vectors
P_ref = [9.8 13 10.51 12.2 10.45 11.4];
P_in = [10.5 11 11.5 11.9 12];
I get the following result:
array1 = [10.5000 11.0000 11.9000 12.0000]
array2 = [11.5000]
If you have a fixed deviation that is allowed for values to be 'close', the key part of your question can be solved with the ismemberf File Exchange Submission.
Basic syntax:
[tf, loc]=ismemberf(0.3, 0:0.1:1)
Can be exended by defining the allowed tolerance:
[tf, loc]=ismemberf(0.3, 0:0.1:1, 'tol', 1.5)

Resources