How to plot a graph based on a txt file and split data by words?

How to plot a graph based on a txt file and split data by words? - arrays

Single row of my output txt file is looks like :
1 open 0 heartbeat 0 closed 0
The gap between data are randomly mixture with different number of \t and space.
I wrote some code like
import numpy as np
import matplotlib.pyplot as plt
with open("../testResults/star-6.txt") as f:
data = f.read()
data = data.split('\n')
x = [row.split'HOW?')[0] for row in data]
y = [row.split('HOW?')[8] for row in data]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("diagram")
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.plot(x,y, c='r', label='the data')
leg = ax1.legend()
plt.show()
Which obviously does not work . is there anyway I could do sort of row.spilit_by_word ?
I am appreciate for any helps ! Thanks ..

Use pandas. Given this data file (I call it "test.csv"):
1 open 0 heartbeat 8 closed 0
2 Open 1 h1artbeat 7 losed 10
3 oPen 0 he2rtbeat 6 cosed 100
4 opEn 1 hea3tbeat 5 clsed 10000
5 opeN 0 hear4beat 4 cloed 10000
6 OPen 1 heart5eat 3 closd 20000
7 OpEn 0 heartb6at 2 close 2000
8 OpeN 1 heartbe7t 1 osed 200
9 oPEn 0 heartbea8 0 lsed 20
You can do this:
import pandas as pd
df=pd.read_csv('test.csv', sep='\s+', header=False)
df.columns=['x',1,2,3,4,5,'y']
x=df['x']
y=df['y']
The rest is the same.
You could also just do:
ax = df.plot(x='x', y='y', title='diagram')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
plt.show()

The simplest approach is to first replace the \t characters to spaces and then split on the spaces:
row.replace('\t',' ').split()
otherwise (e.g. if you have more types of delimiter or very long rows) using re might be better:
re.split('\s*', row)
obviously you need to do import re first.

It's hard to decide which one is the best answer , since #TheBlackCat gives a more specific answer and it looks much simpler then directly use matplotlib I think his answer will be better for beginners like me who can see this question in future.
However based on #hitzg's suggestion I worked out my source code for share.
import numpy as np
import matplotlib.pyplot as plt
filelist = ["mesh-10","line-10","ring-10","star-10","tree-10","tri-graph-10"]
fig = plt.figure()
for filename in filelist:
with open("../testResults/%s.log" %filename) as f:
data = f.read()
data = data.replace('\t',' ')
rows = data.split('\n')
rows = [row.split() for row in rows]
x_arr = []
y_arr = []
for row in rows:
if row: #important : santitize null rows -> cause error
x = float(row[0])/10
y = float(row[8])
x_arr.append(x)
y_arr.append(y)
ax1 = fig.add_subplot(111)
ax1.set_title("result of all topologies")
ax1.set_xlabel('time in sec')
ax1.set_xlim([0,30]) #set x axis limitation
ax1.set_ylabel('num of mydist msg')
ax1.plot(x_arr,y_arr, c=np.random.rand(4,1), label=filename)
leg = ax1.legend()
plt.show()
Hope it helps , many thanks to both #TheBlackCat and #hitzg, best wishes to people who are new to matplotlib and trying to find an answer of this :)

Related

How can I make decision for exactly one data set using ID3 decision tree

I'm implementing a program that ask user for their symptoms (whether they have fever, cough, breathing issue) to check if they need COVID test.
I implemented my id3 decision tree, used some dataset in csv file
Now I want the program be like it can prompt user input to enter their symptoms (whether they have fever, cough, breathing issue), and tell them some info
My code is attached down below, the question is when I ran it, the error msg showed up, I think it is because I only have one dataset in my txt file
pandas.errors.EmptyDataError: No columns to parse from file
May I ask how can I fix it or is their a better way to make decision for just one data?
Thank you!
fever = input("Do you have a fever? (Yes or No) ")
cough = input("Do you cough? (Yes or No) ")
breathing_issue = input("Do you have short breating or other breathing issues? (Yes or No) ")
infected = "Yes"
test_sample = fever + "," + cough + "," + breathing_issue + "," +infected
f = open("test.txt", "w")
f.write(test_sample)
# convert to .csv
test_df = pd.read_csv(r'/Users/xxxx/xxxx/xxxx/test.txt', header=None, delim_whitespace=True)
train_df.columns = ['fever', 'cough', 'breating-issue', 'infected']
pd.set_option("display.max_columns", 500) # Load all columns

The reason this occurs is because lines 7-9 read an empty data frame. Here is a minimal reproducible example demonstrating the error:
import pandas as pd
with open("test.txt", "w") as _fh:
_fh.write("yes,no,yes,no")
df = pd.read_csv("test.txt")
print(df)
Output:
Empty DataFrame
Columns: [yes, no, yes.1, no.1]
Index: []
To get a nonempty DataFrame, either the columns need names or pd.read_csv needs to be called with optional argument header=None. Here is a version where column names are written:
import pandas as pd
with open("test.txt", "w") as _fh:
_fh.write("fever,cough,breathing_issues,infected\n")
_fh.write("yes,no,yes,no")
df = pd.read_csv("test.txt")
print(df)
Output:
fever cough breathing_issues infected
0 yes no yes no

Logistic Regression Coefficient

I need some help on Logistic Regression.
Below is my data:
ID | Mach_1 | Mach_2 | Mach_3 | Mach_4 | Mach_5 | ..Mach300 | Rejected Unit (%) | Yield(%)
127189.11 1 0 1 1 1 0 0.23 98.0%
178390.11 0 0 0 1 0 0 0.10 90.0%
902817.11 1 0 1 0 1 0 0.60 94.0%
DSK1201.11 1 0 0 0 1 0 0.02 99.98%
I have about 300 mach cols and 2K rows. I want to predict for each machine how much the percentage of it contributes to the rejected unit. I want to know which machine is the one is the rejected unit.
I have done some of the coding however I face some error which I don't understand and how to solve it.
Below is my code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
At first this is my code, then I am getting an error.
ValueError: endog must be in the unit interval
Then I scale my y(target variable), then I am getting another error which I don't know why and how to solve it.
This is my latest code after scale the data:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#scale target variable
from sklearn.preprocessing import MinMaxScaler
y_reshape = y.values.reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(y_reshape)
#change the numpy array of y_scale into dataframe
y = pd.DataFrame(y_scale)
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
Then I am getting the error :
Does anyone can help me with this ?

How to create pandas dataframes with more than 2 dimensions?

I want to be able to create n-dimensional dataframes. I've heard of a method for 3D dataframes using panels in pandas but, if possible, I would like to extend the dimensions past 3 dims by combining different datasets into a super dataframe
I tried this but I cannot figure out how to use these methods with my test dataset ->
Constructing 3D Pandas DataFrame
Also, this did not help for my case -> Pandas Dataframe or Panel to 3d numpy array
I made a random test dataset with arbitrary axis data trying to mimic a real situation; there are 3 axis (i.e. patients, years, and samples). I tried adding a bunch of dataframes to a list and then making a dataframe with that but it didn't work :( I even tried a panel as in the 2nd link above but I couldn't get it to work either.
Does anybody know how to create a N-dimensional pandas dataframe w/ labels?
The first way I tried:
#Reproducibility
np.random.seed(1618033)
#Set 3 axis labels/dims
axis_1 = np.arange(2000,2010) #Years
axis_2 = np.arange(0,20) #Samples
axis_3 = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
#Create empty list to store 2D dataframes (axis_2=rows, axis_3=columns) along axis_1
list_of_dataframes=[]
#Iterate through all of the year indices
for i in range(axis_1.size):
#Create dataframe of (samples, patients)
DF_slice = pd.DataFrame(A_3D[i,:,:],index=axis_2,columns=axis_3)
list_of_dataframes.append(DF_slice)
# print(DF_slice) #preview of the 2D dataframes "slice" of the 3D array
# patient_0 patient_1 patient_2
# 0 0.727753 0.154701 0.205916
# 1 0.796355 0.597207 0.897153
# 2 0.603955 0.469707 0.580368
# 3 0.365432 0.852758 0.293725
# 4 0.906906 0.355509 0.994513
# 5 0.576911 0.336848 0.265967
# ...
# 19 0.583495 0.400417 0.020099
# DF_3D = pd.DataFrame(list_of_dataframes,index=axis_2, columns=axis_1)
# Error
# Shape of passed values is (1, 10), indices imply (10, 20)
2nd way I tried:
DF = pd.DataFrame(axis_3,columns=axis_2)
#Error:
#Shape of passed values is (1, 3), indices imply (20, 3)
# p={}
# for i in axis_1:
# p[i]=DF
# panel= pd.Panel(p)
I could do something like this I guess, but I really like pandas and would rather use one of their methods if one exists:
#Set data for query
query_year = 2007
query_sample = 15
query_patient = "patient_1"
#Index based on query
A_3D[
(axis_1 == query_year).argmax(),
(axis_2 == query_sample).argmax(),
(axis_3 == query_patient).argmax()
]
#0.1231212416981845
It would be awesome to access the data in this way:
DF_3D[query_year][query_sample][query_patient]
#Where DF_3D[query_year] would give a list of 2D arrays (row=sample, col=patient)
# DF_3D[query_year][query_sample] would give a 1D vector/list of patient data for a particular year, of a particular sample.
# and DF_3D[query_year][query_sample][query_patient] would be a particular sample of a particular patient of a particular year

Rather than using an n-dimensional Panel, you are probably better off using a two dimensional representation of data, but using MultiIndexes for the index, column or both.
For example:
np.random.seed(1618033)
#Set 3 axis labels/dims
years = np.arange(2000,2010) #Years
samples = np.arange(0,20) #Samples
patients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
# Create the MultiIndex from years, samples and patients.
midx = pd.MultiIndex.from_product([years, samples, patients])
# Create sample data for each patient, and add the MultiIndex.
patient_data = pd.DataFrame(np.random.randn(len(midx), 3), index = midx)
>>> patient_data.head()
0 1 2
2000 0 patient_0 -0.128005 0.371413 -0.078591
patient_1 -0.378728 -2.003226 -0.024424
patient_2 1.339083 0.408708 1.724094
1 patient_0 -0.997879 -0.251789 -0.976275
patient_1 0.131380 -0.901092 1.456144
Once you have data in this form, it is relatively easy to juggle it around. For example:
>>> patient_data.unstack(level=0).head() # Years.
0 ... 2
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 patient_0 -0.128005 0.051558 1.251120 0.666061 -1.048103 0.259231 1.535370 0.156281 -0.609149 0.360219 ... -0.078591 -2.305314 -2.253770 0.865997 0.458720 1.479144 -0.214834 -0.791904 0.800452 0.235016
patient_1 -0.378728 -0.117470 -0.306892 0.810256 2.702960 -0.748132 -1.449984 -0.195038 1.151445 0.301487 ... -0.024424 0.114843 0.143700 1.732072 0.602326 1.465946 -1.215020 0.648420 0.844932 -1.261558
patient_2 1.339083 -0.915771 0.246077 0.820608 -0.935617 -0.449514 -1.105256 -0.051772 -0.671971 0.213349 ... 1.724094 0.835418 0.000819 1.149556 -0.318513 -0.450519 -0.694412 -1.535343 1.035295 0.627757
1 patient_0 -0.997879 -0.242597 1.028464 2.093807 1.380361 0.691210 -2.420800 1.593001 0.925579 0.540447 ... -0.976275 1.928454 -0.626332 -0.049824 -0.912860 0.225834 0.277991 0.326982 -0.520260 0.788685
patient_1 0.131380 0.398155 -1.671873 -1.329554 -0.298208 -0.525148 0.897745 -0.125233 -0.450068 -0.688240 ... 1.456144 -0.503815 -1.329334 0.475751 -0.201466 0.604806 -0.640869 -1.381123 0.524899 0.041983
In order to select the data, please refere to the docs for MultiIndexing.

You should consider using xarray instead. From their documentation:
Panel, pandas’ data structure for 3D arrays, was always a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas removed Panel in favor of directing users who use multi-dimensional arrays to xarray.

An alternative approach (to Alexander) that is derived from the structure of the input data is:
np.random.seed(1618033)
#Set 3 axis labels/dims
years = np.arange(2000,2010) #Years
samples = np.arange(0,20) #Samples
patients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
# Reshape data to 2 dimensions
maj_dim = 1
for dim in A_3D.shape[:-1]:
maj_dim = maj_dim*dim
new_dims = (maj_dim, A_3D.shape[-1])
A_3D = A_3D.reshape(new_dims)
# Create the MultiIndex from years, samples and patients.
midx = pd.MultiIndex.from_product([years, samples])
# Note that Cartesian product order is the same as the
# C-order used by default in ``reshape``.
# Create sample data for each patient, and add the MultiIndex.
patient_data = pd.DataFrame(data = A_3D,
index = midx,
columns = patients)
>>>> patient_data.head()
patient_0 patient_1 patient_2
2000 0 0.727753 0.154701 0.205916
1 0.796355 0.597207 0.897153
2 0.603955 0.469707 0.580368
3 0.365432 0.852758 0.293725
4 0.906906 0.355509 0.994513

Gnuplot: import x-axis from file

I have two files 'results.dat' and 'grid.dat'.
The results.dat contains per row a different data set of y values.
1 325.5 875.4 658.7 365.5
2 587.5 987.5 478.6 658.5
3 987.1 542.6 986.2 458.7
The grid.dat contains the corresponding x values.
1 100.0 200.0 300.0 400.0
How can I plot with gnuplot the grid.dat as x values und a specific line of results.dat as corresponding y values? E.g. line 3:
1 100.0 987.1
2 200.0 542.6
3 300.0 986.2
4 400.0 458.7
Thanks in advance.

Thats quite similar to the recent question Gnuplot: plotting the maximum of two files. In your case it is also not possible to do it with gnuplot only.
You need an external tool to combine the two files on-the-fly, e.g. with the following python script (any other tool would also do):
""" selectrow.py: Select a row from 'results.dat' and merge with 'grid.dat'."""
import numpy as np
import sys
line = int(sys.argv[1])
A = np.loadtxt('grid.dat')
B = np.loadtxt('results.dat', skiprows=(line-1))[0]
np.savetxt(sys.stdout, np.c_[A, B], delimiter='\t')
And then plot the third line of results.dat with
plot '< python selectrow.py 3' w l

How do I make this specific code run faster in Matlab?

I have an array with a set of chronological serial numbers and another source array with random serial numbers associated with a numeric value. The code creates a new cell array in MATLAB with the perfectly chronological serial numbers in one column and in the next column it inserts the associated numeric value if the serial numbers match in both original source arrays. If they don't the code simply copies the previous associated value until there is a new match.
j = 1;
A = {random{1:end,1}};
B = cell2mat(A);
value = random{1,2};
data = cell(length(serial), 1);
data(:,1) = serial(:,1);
h = waitbar(0,'Please Wait...');
steps = length(serial);
for k = 1:length(serial)
[row1, col1, vec1] = find(B == serial{k,1});
tf1 = isempty(vec1);
if (tf1 == 0)
prices = random{col1,2};
data(j,2) = num2cell(value);
j = j + 1;
else
data(j,2) = num2cell(value);
j = j + 1;
end
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end
close(h);
Right now, the run-time for the code is approximately 4 hours. I would like to make this code run faster. Please suggest any methods to do so.
UPDATE
source input (serial)
1
2
3
4
5
6
7
source input (random)
1 100
2 105
4 106
7 107
desired output (data)
SR No Value
1 100
2 105
3 105
4 106
5 106
6 106
7 107

Firstly, run the MATLAB profiler (see 'doc profile') and see where the bulk of the execution time is occuring.
Secondly, don't update the waitbar on every iteration> Particularly if serial contains a large (> 100) number of elements.
Do something like:
if (mod(k, 100)==0) % update on every 100th iteration
waitbar(k/steps,h,['Please Wait... ' num2str(k/steps*100) ' %'])
end

Some points:
Firstly it would help a lot if you gave us some sample input and output data.
Why do you initialize data as one column and then fill it's second in the loop? Rather initialize it as 2 columns upfront: data = cell(length(serial), 2);
Is j ever different from k, they look identical to me and you could just drop both the j = j + 1 lines.
tf1 = isempty(vec1); if (tf1 == 0)... is the same as the single line: if (!isempty(vec1)) or even better if(isempty(vec1)) and then swap the code from your else and your if.
But I think you can probably find a fast vecotrized solution if you provide some (short) sample input and output data.