How to convert string datatypes to float in numpy arrays in Python 3 - arrays

I have been using Python 2.7 for some time now and have recently switched to Python 3. I have already updated my code on some points, but the problem I currently have deludes me. What I am trying to do is to load a dataset using np.loadtxt. Because this data also contains strings I am importing the full array as a string. I want to do type conversions after to convert some entries to float. This fails miserably and I do not understand why. All I see is that in Python 3 all strings get the prefix 'b' and I have the feeling this has something to do with this, but I cannot find a concise answer. Code and error below.
filename = 'train.csv'
raw_data = open(filename, 'rb')
data = np.loadtxt(raw_data, delimiter=",", dtype = 'str')
dataset = data[1:,1:]
print(dataset)
original_data = dataset
test = float(dataset[0,0])
print(test)
Result
[["b'60'" "b'RL'" "b'65'" ..., "b'WD'" "b'Normal'" "b'208500'"]
["b'20'" "b'RL'" "b'80'" ..., "b'WD'" "b'Normal'" "b'181500'"]
["b'60'" "b'RL'" "b'68'" ..., "b'WD'" "b'Normal'" "b'223500'"]
...,
["b'70'" "b'RL'" "b'66'" ..., "b'WD'" "b'Normal'" "b'266500'"]
["b'20'" "b'RL'" "b'68'" ..., "b'WD'" "b'Normal'" "b'142125'"]
["b'20'" "b'RL'" "b'75'" ..., "b'WD'" "b'Normal'" "b'147500'"]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-c154945cd6f1> in <module>()
5 print(dataset)
6 original_data = dataset
----> 7 test = float(dataset[0,0])
8 print(test)
ValueError: could not convert string to float: "b'60'"

As suggested by dnalow, something goes wrong in the type conversion because I first open the file and then read from it. The solution is to not use open open(filename, 'rb') and np.loadtxt, but to use np.genfromtxt. Code below.
filename = 'train.csv'
data = np.genfromtxt(filename, delimiter=",", dtype = 'str')
dataset = data[1:,1:]
print(dataset)
original_data = dataset
test = float(dataset[0,0])
print(test)
filename = 'train.csv'
data = np.genfromtxt(filename, delimiter=",", dtype = 'str')
dataset = data[1:,1:]
print(dataset)
original_data = dataset
test = float(dataset[0,0])
print(test)
Result
[['60' 'RL' '65' ..., 'WD' 'Normal' '208500']
['20' 'RL' '80' ..., 'WD' 'Normal' '181500']
['60' 'RL' '68' ..., 'WD' 'Normal' '223500']
...,
['70' 'RL' '66' ..., 'WD' 'Normal' '266500']
['20' 'RL' '68' ..., 'WD' 'Normal' '142125']
['20' 'RL' '75' ..., 'WD' 'Normal' '147500']]
60.0

Related

Python : Replace a column in a dataframe by datetime values

I'm trying to replace a column an array of 4 columns by datetime values that I treated. The problem is that it's difficult to keep the same form between the different formats of dataframe, array,....
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:,0,0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = pd.to_datetime(ds.variables["time"][:],origin=pd.Timestamp('1850-01-01'),unit='D')
#np.datetime64(ds.variables["time"][:],'D')
x2 = pd.DataFrame(np.zeros((len(dataw),4), float))
x = np.zeros((len(dataw),4), float)
x[:,0] = time
x[:,1] = long
x[:,2] = lat[:]
x[:,3] = dataw[:]*86400
x=pd.DataFrame(x)
x[:,0] = pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
If I put directly the dates transformed in the array, the result is like: 1.32542e+18
I tried
time = ds.variables["time"][:]
and include it in the array, and then use
x[:,0]=pd.to_datetime(x[:,0],origin=pd.Timestamp('1850-01-01'),unit='D')
I get the error:
TypeError: unhashable type: 'slice'
I tried also directly put:
time=pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
x[:,0] = time[:]
TypeError: unhashable type: 'slice'
try this instead
import numpy as np
import pandas as pd
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:, 0, 0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = np.datetime64(time, 'D')
x = np.zeros((len(dataw), 4), dtype='datetime64[D]')
x[:, 0] = time
x[:, 1] = long
x[:, 2] = lat
x[:, 3] = dataw * 86400
df = pd.DataFrame(x, columns=["Time", "Longitude", "Latitude", "Data"])
Xarray makes the netcdf->pandas workflow quite straightforward:
import xarray as xr
ds = xr.open_dataset('file.nc', engine='netcdf4')
df = ds.to_pandas()
Presuming your time variable is using cf-conventions, Xarray will automatically decode it into datetime objects.

Calculate radius r = x^2 + y^2 using array values

I would like to use my calculated y[0] (x-values) and y[1] (y-values) from sol_3 using the solve_ivp to find the radius(r) and plot r(t).
How can I calculate r using x and y values from sol_r? I keep getting TypeError: only size-1 arrays can be converted to Python scalars.
It seems like I am having issues with data type.
Please find my code below:
timing = np.array([0,2*math.pi]) #t0 = 0 tf=2pi
sol_3 = solve_ivp(fun = motion, t_span=[-math.pi, 2*math.pi], y0 = initial, method='RK23')
x = (sol_3.y[0, :])
y = (sol_3.y[1, :])
r = math.sqrt(sol_3.y[1,:]**2+sol_3.y[0,:]**2)
The array of the sol_3 looks like this:
message: 'The solver successfully reached the end of the integration interval.'
nfev: 599
njev: 0
nlu: 0
sol: None
status: 0
success: True
t: array([-3.14159265, -3.14155739, -3.14120473, -3.13767817, -3.12223578,
-3.09595606, -3.0602884 , -3.01493539, -2.96220999, -2.90868487,
-2.86527118, -2.82204945, -2.76814683, -2.69499434, -2.61108567,
-2.51943819, -2.44000487, -2.36057156, -2.28852194, -2.20054302,
-2.10648128, -2.03001656, -1.97695204, -1.92388751, -1.87465566,
-1.82156056, -1.76673945, -1.73011416, -1.70585555, -1.68159695,
-1.64979671, -1.60847127, -1.55783888, -1.50388809, -1.45063319,
-1.40126952, -1.3488944 , -1.27797028, -1.19501887, -1.1039372 ,
-1.02086808, -0.95113058, -0.88636741, -0.80259334, -0.70932861,
-0.61584263, -0.55777683, -0.50936554, -0.46344206, -0.4113168 ,
-0.35663683, -0.31427459, -0.28317047, -0.25206634, -0.22135512,
-0.18102468, -0.13108896, -0.07689042, -0.02318517, 0.02738243,
0.07818198, 0.14701599, 0.22918967, 0.31978884, 0.41431515,
0.48493772, 0.55000025, 0.63412425, 0.72775963, 0.82162403,
0.87984573, 0.92849841, 0.97470164, 1.02711062, 1.0820909 ,
1.12462382, 1.15583674, 1.18704967, 1.21796912, 1.25856295,
1.30880593, 1.36330716, 1.41729713, 1.46809564, 1.51919004,
1.58841181, 1.67102044, 1.76206092, 1.85696013, 1.92801041,

Data arrays must have the same length, and match time discretization in dynamic problems error in GEKKO

I want to find the value of the parameter m that minimizes my variable x subject to a system of differential equations. I have the following code
from gekko import GEKKO
def run_model_m(days, population, case, k_val, b_val, u0_val, sigma_val, Kmax0, a_val, c_val):
list_x =[]
list_u =[]
list_Kmax =[]
for i in range(len(days)):
list_xi=[]
list_ui=[]
list_Ki=[]
for j in range(len(days[i])):
#try:
m = GEKKO(remote=False)
#m.time= days[i][j]
eval = np.linspace(days[i][j][0], days[i][j][-1], 100, endpoint=True)
m.time = eval
x_data= population[i][j]
variable= np.linspace(population[i][j][0], population[i][j][-1], 100, endpoint=True)
x = m.Var(value=population[i][j][0], lb=0)
sigma= m.Param(sigma_val)
d = m.Param(c_val)
k = m.Param(k_val)
b = m.Param(b_val)
r = m.Param(a_val)
step = np.ones(len(eval))
step= 0.2*step
step[0]=1
m_param = m.CV(value=1, lb=0, ub=1, integer=True); m_param.STATUS=1
u = m.Var(value=u0_val, lb=0, ub=1)
#m.free(u)
a = m.Param(a_val)
c= m.Param(c_val)
Kmax= m.Param(Kmax0)
if case == 'case0':
m.Equations([x.dt()== x*(r*(1-x/(Kmax))-m_param/(k+b*u)-d), u.dt()== sigma*(m_param*b/((k+b*u)**2))])
elif case == 'case4':
m.Equations([x.dt()== x*(r*(1-u**2)*(1-x/(Kmax))-m_param/(k+b*u)-d), u.dt() == sigma*(-2*u*r*(1-x/(Kmax))+(b*m_param)/(b*u+k)**2)])
p = np.zeros(len(eval))
p[-1] = 1.0
final = m.Param(value=p)
m.Obj(x)
m.options.IMODE = 6
m.options.MAX_ITER=15000
m.options.SOLVER=1
# optimize
m.solve(disp=False, GUI=False)
#m.open_folder(dataset_path+'inf')
list_xi.append(x.value)
list_ui.append(u.value)
list_Ki.append(m_param.value)
list_x.append(list_xi)
list_Kmax.append(list_Ki)
list_u.append(list_ui)
return list_x, list_u, list_Kmax, m.options.OBJFCNVAL
scaled_days[i][j] =[-7.0, 42.0, 83.0, 125.0, 167.0, 217.0, 258.0, 300.0, 342.0]
scaled_pop[i][j] = [0.01762491277346285, 0.020592540360308997, 0.017870838266697213, 0.01690069378982034,0.015512320147187675,0.01506701796298272,0.014096420738841563,0.013991224004743027,0.010543380664478205]
k0,b0,group, case0, u0, sigma0, K0, a0, c0 = (100, 20, 'Size3, Inc', 'case0', 0.1, 0.05, 2, 0, 0.01)
list_x2, list_u2, list_Kmax2,final =run_model_m(days=[[scaled_days[i][j]]], population=
[[scaled_pop[i][j]]],case=case1, k_val=list_b1[i0][0], b_val=b1, u0_val=list_u1[i0][j0],
sigma_val=sigma1, Kmax0=K1, a_val=list_Kmax1[0][0], c_val=c1)
I get the error Data arrays must have the same length, and match time discretization in dynamic problems error but I don't understand why. I have tried making x and m_param arrays, with x=m.Var, m_param =m.MV... But still get the same error, even if they are all arrays of the same length. Is this the right way to find the solution of the minimization problem?
I think the error was just that in run_model_m I was passing a list as u0_val and it didn't have the same dimensions as m.time. So it should be u0_val=list_u1[0][0][0]

Importing data from multiple .csv files into single DataFrame

I'm having trouble getting data from several .csv files into a single array. I can get all of the data from the .csv files fine, I just can't get everything into a simple numpy array. The name of each .csv file is important to me so in the end I'd like to have a Pandas DataFrame with the columns labeled by the initial name of the .csv file.
import glob
import numpy as np
import pandas as pd
files = glob.glob("*.csv")
temp_dict = {}
wind_dict = {}
for file in files:
data = pd.read_csv(file)
temp_dict[file[:-4]] = data['HLY-TEMP-NORMAL'].values
wind_dict[file[:-4]] = data['HLY-WIND-AVGSPD'].values
temp = []
wind = []
name = []
for word in temp_dict:
name.append(word)
temp.append(temp_dict[word])
for word in wind_dict:
wind.append(wind_dict[word])
temp = np.array(temp)
wind = np.array(wind)
When I print temp or wind I get something like this:
[array([ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9])
array([ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2])
array([ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6])
...
array([ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7])]
when what I really want is:
[[ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9]
[ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2]
[ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6]
...
[ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7]]
This does not work but is the goal of my code:
df = pd.DataFrame(temp, columns=name)
And when I try to use a DataFrame from Pandas each row is its own array which isn't helpful because it thinks every row has only element in it. I know the problem is with "array(...)" I just don't know how to get rid of it. Thank you in advance for your time and consideration.
I think you can use:
files = glob.glob("*.csv")
#read each file to list of DataFrames
dfs = [pd.read_csv(fp) for fp in files]
#create names for each file
lst4 = [x[:-4] for x in files]
#create one big df with MultiIndex by files names
df = pd.concat(dfs, keys=lst4)
If want separately DataFrames change last row above solution with reshape:
df = pd.concat(dfs, keys=lst4).unstack()
df_temp = df['HLY-TEMP-NORMAL']
df_wind = df['HLY-WIND-AVGSPD']

TypeError: ufunc 'add' did not contain a loop

I use Anaconda and gdsCAD and get an error when all packages are installed correctly.
Like explained here: http://pythonhosted.org/gdsCAD/
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
My imports look like this (In the end I imported everything):
import numpy as np
from gdsCAD import *
import matplotlib.pyplot as plt
My example code looks like this:
something = core.Elements()
box=shapes.Box( (5,5),(1,5),0.5)
core.default_layer = 1
core.default_colors = 2
something.add(box)
something.show()
My error message looks like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-2f90b960c1c1> in <module>()
31 puffer_wafer = shapes.Circle((0.,0.), puffer_wafer_radius, puffer_line_thickness)
32 bp.add(puffer_wafer)
---> 33 bp.show()
34 wafer = shapes.Circle((0.,0.), wafer_radius, wafer_line_thickness)
35 bp.add(wafer)
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in _show(self)
80 ax.margins(0.1)
81
---> 82 artists=self.artist()
83 for a in artists:
84 a.set_transform(a.get_transform() + ax.transData)
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in artist(self, color)
952 art=[]
953 for p in self:
--> 954 art+=p.artist()
955 return art
956
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in artist(self, color)
475 poly = lines.buffer(self.width/2.)
476
--> 477 return [descartes.PolygonPatch(poly, lw=0, **self._layer_properties(self.layer))]
478
479
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in _layer_properties(layer)
103 # Default colors from previous versions
104 colors = ['k', 'r', 'g', 'b', 'c', 'm', 'y']
--> 105 colors += matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15))
106 color = colors[layer % len(colors)]
107 return {'color': color}
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The gdsCAD has been a pain from shapely install to this plotting issue.
This issue is because of wrong datatype being passed to colors function. It can be solved by editing the following line in core.py
colors += matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15))
to
colors += list(matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15)))
If you dont know where the core.py is located. Just type in:
from gdsCAD import *
core
This will give you the path of core.py file. Good luck !
Well first, I'd ask that you please include actual code, as the 'example code' in the file is obviously different based on the traceback. When debugging, the details matter, and I need to be able to actually run the code.
You obviously have a data type problem. Chances are pretty good it's in the variables here:
puffer_wafer = shapes.Circle((0.,0.), puffer_wafer_radius, puffer_line_thickness)
I had the same error thrown when I was running a call to Pandas. I changed the data to str(data) and the code worked.
I don't know if this helps I am fairly new to this myself, but I had a similar error and found that it is due to a type casting issue as suggested by previous answer. I can't see from the example in the question exactly what you are trying to do. Below is a small example of my issue and solution. My code is making a simple Random Forest model using scikit learn.
Here is an example that will give the error and it is caused by the third to last line, concatenating the results to write to file.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
This leads to an error of;
Traceback (most recent call last):
File "min_example.py", line 40, in <module>
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The solution is to make each variable a str() type on the third to last line then write to file. No other changes to then code have been made from the above.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(str(RFpreds[i])+",,"+str(yTest[i])+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
These examples are from a larger code so I hope the examples are clear enough.

Resources