matrices are not aligned while trying to use np.dot in numpy arrays in jupyter notebook - arrays

The below code I am trying to do in jupyter notebook and in this i am not possible to do dot product of two matrices
# creating random array
np.random.seed(0)
sales_amounts = np.random.randint(20 , size=(5,3))
sales_amounts
# creating weekly sales dataframe
weekly_sales = pd.DataFrame(sales_amounts, index =["Mon","Tues","Wed","Thur","Fri"],
columns =["Almond Butter","Peanut Butter","Cashew Butter"])
weekly_sales
# Create the price array
prices = np.array([10,8,12])
prices
prices.shape
#Create butter prices dataframe
butter_prices = pd.DataFrame(prices.reshape(1,3), index=["price"],columns= ["Almond Butter","Peanut_Butter","Cashew Butter"])
butter_prices
# shapes not aligned lets transpose
total_sales = prices.dot(sales_amounts.T)
total_sales
#creating daily sales
butter_prices.shape,weekly_sales.shape
daily_sales = butter_prices.dot(weekly_sales.T)
After executing the above code in jupyter notebook
it shows as error: matrices are not aligned

enter image description here
Solution:
np.dot(butter_prices, weekly_sales.T) works
Explanation:
Read the first answer from this thread that explains why this is happening.

Related

Looping dataframes to mmer model, saving the output in 1 dataframe

Badly needed advise on how I can execute a code where all the dataframes will be use in a a sommer model. In addition, the file output varcomp should be stored in another data frame in order to compute the h2.
To give you an idea here's the scenario.
I have 388 dataframes each having 3 columns, 2 columns are random effects variable (REV) whereas the other column is the phenotype.
the code I am using is
ans1<-mmer(Phenotype~1, random= C1 + C2 + C1:C2, data= dataframe1)
summary(ans1)$VarComp
The last code will give you the variance components of the model.
I need to save the Varcomp object in a dataframe where the Phenotype is the column name in order to compute the h2 at the end of the analysis.
Thank you so much

Plotting timeseries plot from structure array in MATLAB

I have downloaded stock prices data for each company in the S&P500 (over the last 5 years) and stored them in a structure array called stocks_adj_close (see code).
I would like to perform some data analysis such as plotting, for example I would like to plot all fives stocks on the same graph.
load('SP500stocks.mat'); %load data from a previously saved workspace
fields ={'Open', 'High', 'Low', 'Close', 'Volume'}; %create a variable fields
stocks_adj_close = rmfield(stocks, fields); % removes the fields mentionned here above
struct2csv(stocks_adj_close, 'SP500 Adjusted Stocks Prices (2000-20019).csv');

How to read the same column from multiple files and collect it in an array

I have 9 csv files each containing the same number of columns (61) as well as the same column headers. The files are basically follows-up of each other. Each column belongs to a signal reading which has been recorded for a long period of time and hence divided into multiple files. I need to graph the collected data for every single column. To do that I thought I would read one column at a time from all files and store the data into an array and graph it against time.
Since the data load is to much, system takes a reading every 5 seconds for a month, I want to read the data for every 30 mins which equals to reading 1 row per 362 rows.
I've tried plotting everything without skipping rows and it takes forever because of the data load.
file_list = glob.glob('*.csv')
cols = [0,1] # add more columns here
df = pd.DataFrame()
for f in file_list:
df = df.append(
pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
ignore_index=True,
)
arr = df.values
This is what I tried to read only specific columns from multiple files but I receive this message: "Usecols do not match columns, columns expected but not found: [1]"
the command below will do a parallel read followed by a concatenation. Assuming file_list contains a list of files that can be read with read_file function below
import multiprocessing as mp
def read_file(file):
return pd.read_csv(file)
pool = mp.Pool(mp.cpu_count()) # one worker per CPU. You can try other things
df = pd.concat(pool.map(read_file, file_list)
pool.terminate()
pool.join()

Python Logistic Regression (New to Data Science)

I'm working on a project where there are two excel's one master.xls and another sample.xls. Master files has both the dependent and the independent variable. Whereas sample.xls has only the independent variable and need to create the independent variable (Either 1 or 0, 1 = Diabetic, 0 = Not Diabetic)
Now i need to use the master files to train the model and predict the independent variable for sample file. But not sure how to split the data between train and test . Need help.
Use read_excel function of pandas library to load the data (say master.xls)
import pandas as pd
df = pd.read_excel('master.xls')
Lets say y is the dependent variable (i.e, ground truth value in Machine Learning terminology). Get the y column values & delete it from the dataframe df
y = df['y']
df = df.drop(['y'],axis=1)
Now use train_test_split function of scikit-learn to split the data into train & test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3)
Now X_train will have 70% of the total data points & X_test will have 30% of the total data points. y_train & y_test are respective dependent variables of train & test data

Efficient Excel formula for returning multiple matches from a large number of rows

I'm stumped by a major issue. I have a data set consisting of about 16000 rows (could be more in future). This list is basically a price list containing products and their corresponding installation fees. Now the products are classified by the following hierarchy: City -> Category -> Rating/Type. Before I was using named ranges to refer to each set by concatenating City & Category & Rating (_XYZ_SPC_9.5). This resulted in about 1500 named ranges which inflated the size of the Excel file. So I decided to calculate the products on-the-fly using inputs from the user. I have tried array formulas and simple formulas but they take some time to calculate (16000 rows!!) which is not acceptable from a usability perspective; our sales people are very particular about how much time they have to spend on the tool.
I have uploaded a sample file at:
Price List Sample
Formulas that I have used so far are:
=IFERROR(INDEX($H$6:$H$15000, SMALL(INDEX(($AE$9=$R$6:$R$15000)*(MATCH(ROW($R$6:$R$15000), ROW($R$6:$R$15000)))+($AE$9<>$R$6:$R$15000)*15000, 0, 0), AC3)),"Not Available")
{=IFERROR(INDEX(ref_PRICE_LIST!$H$6:$H$16074,MATCH(INDEX(ref_PRICE_LIST!$H$6:$H$16074,(SMALL(IF(IF(RIGHT($AE$3,3)="All",ref_PRICE_LIST!$Z$6:$Z$16074,ref_PRICE_LIST!$R$6:$R$16074)=$AE$3,ROW(ref_PRICE_LIST!$H$6:$H$16074)-ROW(ref_PRICE_LIST!$H$6)+1),$AC3))),ref_PRICE_LIST!$H$6:$H$16074,0),1),"Not Available")}
I would really appreciate if someone can help me out.
Thank you so much!
I think the best way to speed this up is to split the formula into a helper column K and a reult column L
Helper Column (copy down for all 16,000 data rows)
=IF($D:$D=$O$2,ROW(),"")
Result column (starting at L2, copy down as many as you need)
=IFERROR(INDEX($F:$F,SMALL($K:$K,ROW()-1)),"Not available")
I've tested this with about 150,000 rows and it updates in < 1s

Resources