generate dataframe of model predictions by looping through dictionary of models - loops

I would like to loop through a dictionary like this:
models = {'OLS': LinearRegression(),
'Lasso': Lasso(),
'LassoCV': LassoCV(n_alphas=300, cv=3)}
and then i want to generate a dataframe of the each model's predictions.
So far I wrote this to code, which only generates arrays of each result:
predictions = []
for i in models:
predictions.append(models[i].fit(X_train,y_train).predict(X_test))
As the final result, I want a dataframe with each column labelled by the key in the dictionary and the result values associated with the model key name inside the column.
Thank you!

Instead of appending the predictions to the list, you can directly insert the predictions into a data frame.
Code:
import pandas as pd
models = {'OLS': LinearRegression(),
'Lasso': Lasso(),
'LassoCV': LassoCV(n_alphas=300, cv=3)}
df = pd.DataFrame()
for i in models:
df[i] = models[i].fit(X_train,y_train).predict(X_test)

Related

How can I extract an image and a caption dataset from a h5 file?

I want to use FashionGen dataset that has 2 h5 format file for train and validation data. The h5 file's list of datasets are like this:
index
index_2
input_brand
input_category
input_composition
input_concat_description
input_department
input_description
input_gender
input_image
input_msrpUSD
input_name
input_pose
input_productID
input_season
input_subcategory
And I just need the "Input_image" and "Input_description" datasets. Would you mind help me please?
Details depend on the dataset dtype and shape and Python objects to be created. This code will get you started. Review h5py docs for details. h5py Quick Start Guide. Note: dataset and group names are case sensitive. Be sure to verify if they are "Input_image" or "input_image".
with h5py.File(filename,'r') as h5f:
# create NumPy array from image dataset:
image_arr = h5f['input_image'][:]
# create NumPy array from description dataset:
descr_arr = h5f['input_description'][:]
Note: if the datasets are too large to fit into memory, you can use h5py dataset objects and reference as-if they are NumPy arrays. The code is very similar. See below:
with h5py.File(filename,'r') as h5f:
# create h5py object of images dataset:
image_ds = h5f['input_image']
# create NumPy object of description dataset:
descr_ds = h5f['input_description']

Should I use collection or dictionary for vba?

I have multiple datasets and sheets of data of these(excel).
It's train time tables
depart a a a a
arrive t t t t
Train z1 z2 z3 z4
station
a
as 6:30:00 7:47:00 8:18:00 9:45:00
b
bs 6:33:00 7:50:00 8:21:00 9:48:00
c 6:35:00 7:52:00 8:23:00 9:50:00
cs 6:35:30 7:52:30 8:23:30 9:50:30
I try to put the data into collection or dictionary
To pull out data (mainly time) by train and station, or by time and station etc.
For dictionary, Seems like I need nested dictionary
For collection, may I loop through all item by criteria?
Can anyone give me a hint what method to use for getting data(time or station or train)?
Any advice would be appreciated.
Thank you
Whether to put them in a Collection or a Dictionary depends on how you intend to retrieve them afterwards.
Start by describing the data of a single record using public fields in a class module:
Option Explicit
Public TrainID As String
Public DepartureStationID As String
Public DepartureUTC As Date
Public ArrivalStationID As String
Public ArrivalUTC As Date
Now you can write a function that can parse a worksheet Range into a single instance of that class, then a procedure that runs this function for each interesting Range to parse (I'm not sure I'm reading the data correctly but that would be the general idea).
If you plan to iterate them all and run some method in a loop, use a Collection (and a For Each loop).
If you plan to retrieve them by ID, then you could use the TrainID or DepartureStationID field as a dictionary key, and have each "item" be a Collection of instances of that class (lest you'll run into duplicate key issues).
If you plan to parse a bunch of data sources and aggregate them into a queryable dataset, you only need a Collection to store the objects you're parsing; you'll be iterating that collection when you dump these objects onto a worksheet/table for pivoting and PowerQuery-ing =)

Python List of users

I want to learn how to create an API, so I want to create an empty dictionary where the first key is Names. Names will be a dictionary with the names of the users the system will have.
How do I actually do it with python?
People = [{}]
I want it to be something like:
People = [Names:["name1", "name2"... "nameN"]]
later on, I want to add more information like for example:
People[Names:[], Age:["1","2"]..]
I want at some stage be able to relate any name to any other key correctly.
name1 has age 1 and next key...
How do I declare this dictionary?
Perhaps using a pandas DataFrame is useful in this case. As the following example illustrates, it allows you to easily add both people and variables.
import pandas as pd
df = pd.DataFrame({'Names':['name1', 'name2'], 'Age':[1, 2]})
# adding a column: Gender
df['Gender'] = ['male', 'female']
# adding a row for name3, a three-year-old male
third_person = {'Names':'name3', 'Age':3, 'Gender':'male'}
df = df.append(third_person, ignore_index=True)
What I was trying to do:
person = {
"name":
{
"title":"Herr","first":"foo","last":"bar"
},
"email":"bar#foo.de",
"password":"123",
"property":"value"
}

SparkR - extracting dataframe's array<int> for an R function

I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate

Best database design (model) for user tables

I'm developping a web application using google appengine and django, but I think my problem is more general.
The users have the possibility to create tables, look: tables are not represented as TABLES in the database. I give you an example:
First form:
Name of the the table: __________
First column name: __________
Second column name: _________
...
The number of columns is not fixed, but there is a maximum (100 for example). The type in every columns is the same.
Second form (after choosing a particular table the user can fill the table):
column_name1: _____________
column_name2: _____________
....
I'm using this solution, but it's wrong:
class Table(db.Model):
name = db.StringProperty(required = True)
class Column(db.Model):
name = db.StringProperty(required = True)
number = db.IntegerProperty()
table = db.ReferenceProperty(table, collection_name="columns")
class Value(db.Model):
time = db.TimeProperty()
column = db.ReferenceProperty(Column, collection_name="values")
when I want to list a table I take its columns and from every columns I take their values:
data = []
for column in data.columns:
column_data = []
for value in column.values:
column_data.append(value.time)
data.append(column_data)
data = zip(*data)
I think that the problem is the order of the values, because it is not true that the order for one column is the same for the others. I'm waiting for this bug (but until now I never seen it):
Table as I want: as I will got:
a z c a e c
d e f d h f
g h i g z i
Better solutions? Maybe using ListProperty?
Here's a data model that might do the trick for you:
class Table(db.Model):
name = db.StringProperty(required=True)
owner = db.UserProperty()
column_names = db.StringListProperty()
class Row(db.Model):
values = db.ListProperty(yourtype)
table = db.ReferenceProperty(Table, collection_name='rows')
My reasoning:
You don't really need a separate entity to store column names. Since all columns are of the same data type, you only need to store the name, and the fact that they are stored in a list gives you an implicit order number.
By storing the values in a list in the Row entity, you can use an index into the column_names property to find the matching value in the values property.
By storing all of the values for a row together in a single entity, there is no possibility of values appearing out of their correct order.
Caveat emptor:
This model will not work well if the table can have columns added to it after it has been populated with data. To make that possible, every time that a column is added, every existing row belonging to that table would have to have a value appended to its values list. If it were possible to efficiently store dictionaries in the datastore, this would not be a problem, but list can really only be appended to.
Alternatively, you could use Expando...
Another possibility is that you could define the Row model as an Expando, which allows you to dynamically create properties on an entity. You could set column values only for the columns that have values in them, and that you could also add columns to the table after it has data in it and not break anything:
class Row(db.Expando):
table = db.ReferenceProperty(Table, collection_name='rows')
#staticmethod
def __name_for_column_index(index):
return "column_%d" % index
def __getitem__(self, key):
# Allows one to get at the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# col1 = first_row[1]
# col12 = first_row[12]
value = None
try:
value = self.__dict__[Row.__name_for_column_index]
catch KeyError:
# The given column is not defined for this Row
pass
return value
def __setitem__(self, key, value):
# Allows one to set the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# first_row[5] = "New values for column 5"
self.__dict__[Row.__name_for_column_index] = value
# In order to allow efficient multiple column changes,
# the put() can go somewhere else.
self.put()
Why don't you add an IntegerProperty to Value for rowNumber and increment it every time you add a new row of values and then you can reconstruct the table by sorting by rowNumber.
You're going to make life very hard for yourself unless your user's 'tables' are actually stored as real tables in a relational database. Find some way of actually creating tables and use the power of an RDBMS, or you're reinventing a very complex and sophisticated wheel.
This is the conceptual idea I would use:
I would create two classes for the data-store:
table this would serve as a
dictionary, storing the structure of
the pseudo-tables your app would
create. it would have two fields :
table_name, column_name,
column_order . where column_order
would give the position of the
column within the table
data
this would store the actual data in
the pseudo-tables. it would have
four fields : row_id, table_name,
column_name , column_data. row_id
would be the same for data
pertaining to the same row and would
be unique for data across the
various pseudo-tables.
Put the data in a LongBlob.
The power of a database is to be able to search and organise data so that you are able to get only the part you want for performances and simplicity issues : you don't want the whole database, you just want a part of it and want it fast. But from what I understand, when you retrieve a user's data, you retrieve it all and display it. So you don't need to sotre the data in a normal "database" way.
What I would suggest is to simply format and store the whole data from a single user in a single column with a suitable type (LongBlob for example). The format would be an object with a list of columns and rows of type. And you define the object in whatever language you use to communicate with the database.
The columns in your (real) database would be : User int, TableNo int, Table Longblob.
If user8 has 3 tables, you will have the following rows :
8, 1, objectcontaintingtable1;
8, 2, objectcontaintingtable2;
8, 3, objectcontaintingtable3;

Resources