Logistic Regression Coefficient - logistic-regression

I need some help on Logistic Regression.
Below is my data:
ID | Mach_1 | Mach_2 | Mach_3 | Mach_4 | Mach_5 | ..Mach300 | Rejected Unit (%) | Yield(%)
127189.11 1 0 1 1 1 0 0.23 98.0%
178390.11 0 0 0 1 0 0 0.10 90.0%
902817.11 1 0 1 0 1 0 0.60 94.0%
DSK1201.11 1 0 0 0 1 0 0.02 99.98%
I have about 300 mach cols and 2K rows. I want to predict for each machine how much the percentage of it contributes to the rejected unit. I want to know which machine is the one is the rejected unit.
I have done some of the coding however I face some error which I don't understand and how to solve it.
Below is my code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
At first this is my code, then I am getting an error.
ValueError: endog must be in the unit interval
Then I scale my y(target variable), then I am getting another error which I don't know why and how to solve it.
This is my latest code after scale the data:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#scale target variable
from sklearn.preprocessing import MinMaxScaler
y_reshape = y.values.reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(y_reshape)
#change the numpy array of y_scale into dataframe
y = pd.DataFrame(y_scale)
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
Then I am getting the error :
Does anyone can help me with this ?

Related

Summation based on unique entries of two arrays | Speed Issue

I have 3 arrays of size 803500*1 with the following details:
Rid: It can contain any number
RidID: It contains elements from 1 to 184 in random order. Each element appears multiple times.
r: It contains elements 0,1,2,...12. All elements (except zero) appear nearly 3400 to 3700 times at random indices in this array.
Following may be useful for generating sample data:
Rid = rand(803500,1);
RidID = randi(184,803500,1);
r = randi(13,803500,1)-1; %This may not be a good sample for r as per previously mentioned details?
What I want to do?
I want to calculate the sum of those entries of Rid which correspond to each positive unique entry of r and each unique entry of RidID.
This may be clearer with the code which I wrote for this problem:
RNum = numel(unique(RidID));
RSum = ones(RNum,12); %Preallocating for better speed
for i=1:12
RperM = r ==i;
for j = 1:RNum
RSum(j,i) = sum(Rid(RperM & (RidID==j)));
end
end
Issue:
My code works but it takes 5 seconds on average on my computer and I have to do this calculation nearly a thousand times. If this time be reduced from 5 seconds to atleast half of it, I'll be very happy. But how do I optimize this? I don't mind if it is made better with vectorization or any better written loop.
I am using MATLAB R2017b.
You can use accumarray :
u = unique(RidID);
A = accumarray([RidID r+1], Rid);
RSum = A(u, 2:13);
This is slower than accumarray as suggested by rahnema, but using findgroups and splitapply may save memory.
In your example, there may be thousands of zero-valued elements in the resulting matrix, where a combination of RidID and r does not occur. In this case a stacked result would be more memory efficient, like so:
RidID | r | Rid_sum
-------------------------
1 | 1 | 100
2 | 1 | 200
4 | 2 | 85
...
This can be achieved with the following code:
[ID, rn, RidIDn] = findgroups(r,RidID); % Get unique combo ID for 'r' and 'RidID'
RSum = splitapply( #sum, Rid, ID ); % Sum for each ID
output = table( RidIDn, rn, RSum ); % Nicely formatted table output
% Get rid of elements where r == 0
output( output.rn == 0, : ) = [];
You could convert this to the same output as the accumarray method, but it's already a slower method...
% Convert to 'unstacked' 2D matrix (optional)
RSum = full( sparse( 1:numel(Ridn), 1:numel(rn), RSum ) );

PySpark replace Null with Array

After a join by ID, my data frame looks as follows:
ID | Features | Vector
1 | (50,[...] | Array[1.1,2.3,...]
2 | (50,[...] | Null
I ended up with Null values for some IDs in the column 'Vector'. I would like to replace these Null values by an array of zeros with 300 dimensions (same format as non-null vector entries). df.fillna does not work here since it's an array I would like to insert. Any idea how to accomplish this in PySpark?
---edit---
Similarly to this post my current approach:
df_joined = id_feat_vec.join(new_vec_df, "id", how="left_outer")
fill_with_vector = udf(lambda x: x if x is not None else np.zeros(300),
ArrayType(DoubleType()))
df_new = df_joined.withColumn("vector", fill_with_vector("vector"))
Unfortunately with little success:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0in stage 848.0 failed 4 times, most recent failure: Lost task 0.3 in stage 848.0 (TID 692199, 10.179.224.107, executor 16): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-193-e55fed27fcd8> in <module>()
5 a = df_joined.withColumn("vector", fill_with_vector("vector"))
6
----> 7 a.show()
/databricks/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
Updated: I couldn't get the SQL expression form to create an array of doubles. 'array(0.0, ...)' appears to create an array of Decimal types. But, using the python functions you can get it to properly create an array of doubles.
The general idea is use the when/otherwise functions to selectively update only the rows you want. You can define the literal value you want ahead of time as a column and then dump that in the "THEN" clause.
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType([StructField("f1", LongType()), StructField("f2", ArrayType(DoubleType(), False))])
data = [(1, [10.0, 11.0]), (2, None), (3, None)]
df = sqlContext.createDataFrame(sc.parallelize(data), schema)
# Create a column object storing the value you want in the NULL case
num_elements = 300
null_value = array([lit(0.0)] * num_elements)
# If you want a different type you can change it like this
# null_value = null_value.cast('array<float>')
# Keep the value when there is one, replace it when it's null
df2 = df.withColumn('f2', when(df['f2'].isNull(), null_value).otherwise(df['f2']))
You could try to make an update request on your dataset with a where, replacing every NULL in the Vector column by an array.
Are you using SparkSQL and dataframes?

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Saving output in the form of an array

I want to save the output out in the form of a column vector which is currently coming out as
ans(:,:,1) =
0
ans(:,:,2) =
0
ans(:,:,3) =
0
ans(:,:,4) =
0
ans(:,:,5) =
0
ans(:,:,6) =
0
ans(:,:,7) =
-5.5511e-017
function [out]= myfun1(in)
in= importdata('X.dat');
l= length(in);
dt=0.05;
l1=(l-1)*dt+1;
ts = timeseries(in, 1:.05:l1);
ts1= resample(ts, 1:1:l1);
out= ts1.data;
I think you might have a problem with the output as a whole. I assume that you want to resample the vector 'in' at sample points 1:1:l1. Resample should be called as:
resample(timeseriesVector, UpsampleValue, DownsampleValue)
This will generate a new timeseries vector at the rate of UpsampleValue/DownsampleValue.
I am sorry if I have misinterpreted your intentions. If you could provide some basic sample data it might be easier to debug.

How to plot a graph based on a txt file and split data by words?

Single row of my output txt file is looks like :
1 open 0 heartbeat 0 closed 0
The gap between data are randomly mixture with different number of \t and space.
I wrote some code like
import numpy as np
import matplotlib.pyplot as plt
with open("../testResults/star-6.txt") as f:
data = f.read()
data = data.split('\n')
x = [row.split'HOW?')[0] for row in data]
y = [row.split('HOW?')[8] for row in data]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("diagram")
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.plot(x,y, c='r', label='the data')
leg = ax1.legend()
plt.show()
Which obviously does not work . is there anyway I could do sort of row.spilit_by_word ?
I am appreciate for any helps ! Thanks ..
Use pandas. Given this data file (I call it "test.csv"):
1 open 0 heartbeat 8 closed 0
2 Open 1 h1artbeat 7 losed 10
3 oPen 0 he2rtbeat 6 cosed 100
4 opEn 1 hea3tbeat 5 clsed 10000
5 opeN 0 hear4beat 4 cloed 10000
6 OPen 1 heart5eat 3 closd 20000
7 OpEn 0 heartb6at 2 close 2000
8 OpeN 1 heartbe7t 1 osed 200
9 oPEn 0 heartbea8 0 lsed 20
You can do this:
import pandas as pd
df=pd.read_csv('test.csv', sep='\s+', header=False)
df.columns=['x',1,2,3,4,5,'y']
x=df['x']
y=df['y']
The rest is the same.
You could also just do:
ax = df.plot(x='x', y='y', title='diagram')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
plt.show()
The simplest approach is to first replace the \t characters to spaces and then split on the spaces:
row.replace('\t',' ').split()
otherwise (e.g. if you have more types of delimiter or very long rows) using re might be better:
re.split('\s*', row)
obviously you need to do import re first.
It's hard to decide which one is the best answer , since #TheBlackCat gives a more specific answer and it looks much simpler then directly use matplotlib I think his answer will be better for beginners like me who can see this question in future.
However based on #hitzg's suggestion I worked out my source code for share.
import numpy as np
import matplotlib.pyplot as plt
filelist = ["mesh-10","line-10","ring-10","star-10","tree-10","tri-graph-10"]
fig = plt.figure()
for filename in filelist:
with open("../testResults/%s.log" %filename) as f:
data = f.read()
data = data.replace('\t',' ')
rows = data.split('\n')
rows = [row.split() for row in rows]
x_arr = []
y_arr = []
for row in rows:
if row: #important : santitize null rows -> cause error
x = float(row[0])/10
y = float(row[8])
x_arr.append(x)
y_arr.append(y)
ax1 = fig.add_subplot(111)
ax1.set_title("result of all topologies")
ax1.set_xlabel('time in sec')
ax1.set_xlim([0,30]) #set x axis limitation
ax1.set_ylabel('num of mydist msg')
ax1.plot(x_arr,y_arr, c=np.random.rand(4,1), label=filename)
leg = ax1.legend()
plt.show()
Hope it helps , many thanks to both #TheBlackCat and #hitzg, best wishes to people who are new to matplotlib and trying to find an answer of this :)

Resources