Python Logistic Regression (New to Data Science) - logistic-regression

I'm working on a project where there are two excel's one master.xls and another sample.xls. Master files has both the dependent and the independent variable. Whereas sample.xls has only the independent variable and need to create the independent variable (Either 1 or 0, 1 = Diabetic, 0 = Not Diabetic)
Now i need to use the master files to train the model and predict the independent variable for sample file. But not sure how to split the data between train and test . Need help.

Use read_excel function of pandas library to load the data (say master.xls)
import pandas as pd
df = pd.read_excel('master.xls')
Lets say y is the dependent variable (i.e, ground truth value in Machine Learning terminology). Get the y column values & delete it from the dataframe df
y = df['y']
df = df.drop(['y'],axis=1)
Now use train_test_split function of scikit-learn to split the data into train & test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3)
Now X_train will have 70% of the total data points & X_test will have 30% of the total data points. y_train & y_test are respective dependent variables of train & test data

Related

Looping dataframes to mmer model, saving the output in 1 dataframe

Badly needed advise on how I can execute a code where all the dataframes will be use in a a sommer model. In addition, the file output varcomp should be stored in another data frame in order to compute the h2.
To give you an idea here's the scenario.
I have 388 dataframes each having 3 columns, 2 columns are random effects variable (REV) whereas the other column is the phenotype.
the code I am using is
ans1<-mmer(Phenotype~1, random= C1 + C2 + C1:C2, data= dataframe1)
summary(ans1)$VarComp
The last code will give you the variance components of the model.
I need to save the Varcomp object in a dataframe where the Phenotype is the column name in order to compute the h2 at the end of the analysis.
Thank you so much

matrices are not aligned while trying to use np.dot in numpy arrays in jupyter notebook

The below code I am trying to do in jupyter notebook and in this i am not possible to do dot product of two matrices
# creating random array
np.random.seed(0)
sales_amounts = np.random.randint(20 , size=(5,3))
sales_amounts
# creating weekly sales dataframe
weekly_sales = pd.DataFrame(sales_amounts, index =["Mon","Tues","Wed","Thur","Fri"],
columns =["Almond Butter","Peanut Butter","Cashew Butter"])
weekly_sales
# Create the price array
prices = np.array([10,8,12])
prices
prices.shape
#Create butter prices dataframe
butter_prices = pd.DataFrame(prices.reshape(1,3), index=["price"],columns= ["Almond Butter","Peanut_Butter","Cashew Butter"])
butter_prices
# shapes not aligned lets transpose
total_sales = prices.dot(sales_amounts.T)
total_sales
#creating daily sales
butter_prices.shape,weekly_sales.shape
daily_sales = butter_prices.dot(weekly_sales.T)
After executing the above code in jupyter notebook
it shows as error: matrices are not aligned
enter image description here
Solution:
np.dot(butter_prices, weekly_sales.T) works
Explanation:
Read the first answer from this thread that explains why this is happening.

sum of variables in SPSS-statistics25 for multiple external files

I have 50 external EXCEL files. For each of these files, let's say #I, I import data as it follows in the SYNTAX of SPSS-statistics25:
GET DATA /TYPE=XLSX
/FILE='file#I.xlsx'
/SHEET=name 'Sheet2'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.
Then, I rank the variables included in #I file (WA CI) and I select one single case, at most, as it follows:
RANK VARIABLES= WA CI (D)
/RANK
/PRINT=YES
/TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF( SVAR=2).
EXECUTE.
The task is the following:
I should print the sum of values of RWA looping on each EXCEL file #I. RWA can have value 1 or can be empty. If there are not selected cases (RWA is empty), the contribution to the sum of values should be 0. The final outcome should be the number of times RWA and RCI have the same TOP rank out of 50 Excel files.
How can I do this in a smart way?
Since I can't see the real data files, the following is a little in the dark, but I think it should be a viable strategy (you might as well try :)):
* first defining a macro to stack all the files together.
define stackFiles ()
GET DATA /TYPE=XLSX /FILE='file1.xlsx'
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767 /keep WA CI.
compute source=1.
exe.
dataset name gen.
!do !i=2 !to 40
GET DATA /TYPE=XLSX /FILE=!concat("'file", !i, ".xlsx'")
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767/keep WA CI.
compute source=!i.
exe.
add files /file=gen /file=*.
exe.
!doend.
!enddefine.
* now run the macro.
stackFiles .
* now for the rest of the analysis.
* first split the data by source file, then rank and select.
sort cases by source.
split file by source.
RANK VARIABLES= WA CI (D) /RANK /PRINT=YES /TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF SVAR=2.
EXECUTE.
At this point you have up to 40 rows remaining - 0 or 1 from each original file. You can count or sum using descriptives RWA.

How to read the same column from multiple files and collect it in an array

I have 9 csv files each containing the same number of columns (61) as well as the same column headers. The files are basically follows-up of each other. Each column belongs to a signal reading which has been recorded for a long period of time and hence divided into multiple files. I need to graph the collected data for every single column. To do that I thought I would read one column at a time from all files and store the data into an array and graph it against time.
Since the data load is to much, system takes a reading every 5 seconds for a month, I want to read the data for every 30 mins which equals to reading 1 row per 362 rows.
I've tried plotting everything without skipping rows and it takes forever because of the data load.
file_list = glob.glob('*.csv')
cols = [0,1] # add more columns here
df = pd.DataFrame()
for f in file_list:
df = df.append(
pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
ignore_index=True,
)
arr = df.values
This is what I tried to read only specific columns from multiple files but I receive this message: "Usecols do not match columns, columns expected but not found: [1]"
the command below will do a parallel read followed by a concatenation. Assuming file_list contains a list of files that can be read with read_file function below
import multiprocessing as mp
def read_file(file):
return pd.read_csv(file)
pool = mp.Pool(mp.cpu_count()) # one worker per CPU. You can try other things
df = pd.concat(pool.map(read_file, file_list)
pool.terminate()
pool.join()

MongoDB and Arctic

I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.

Resources