I am new to python and have tried to accomplish the following with only little success:
In a folder there are *.columns files that do all contain 5 (0-4) columns and 500 rows. I need to sum up the columns 1-4 over all *.columns files and plot the result against the first column of any of that (all equal).
I created an empty array in which I want to paste the first (0) column of array "x_array3" and the columns 1-4 from "y_array0". All of them have the same size (500L, 5L).
Could You please give me an advice how to proceed? I am lost right now.
Christian
import glob
import numpy as np
ListOfFiles = glob.glob("*.columns")
y_array0 = 0
for filename in ListOfFiles:
y_array1 = np.genfromtxt(filename, skip_header = 1, usecols = (0, 1, 2, 3, 4))
y_array0 = y_array0 + y_array1
x_array3 = np.genfromtxt(ListOfFiles[0], skip_header = 1, usecols = (0, 1, 2, 3, 4))
empty_array = np.empty(shape=(500, 5))
ausgabe_array = ??? here I'm stuck ???
np.savetxt('SX_DOS.out', ausgabe_array)
I have found a working solution. I read in all the columns as single arrays and merge them at the end. Still, can anybody give me a hint how one populates an empty array with selected items from another array (with another size)?
Chr.
import glob
import numpy as np
ListOfFiles = glob.glob("*.columns")
y_array_s0 = 0
y_array_p0 = 0
y_array_d0 = 0
y_array_f0 = 0
for filename in ListOfFiles:
y_array_s1 = np.genfromtxt(filename, skip_header = 1, usecols = (1))
y_array_s0 = y_array_s0 + y_array_s1
y_array_p1 = np.genfromtxt(filename, skip_header = 1, usecols = (2))
y_array_p0 = y_array_p0 + y_array_p1
y_array_d1 = np.genfromtxt(filename, skip_header = 1, usecols = (3))
y_array_d0 = y_array_d0 + y_array_d1
y_array_f1 = np.genfromtxt(filename, skip_header = 1, usecols = (4))
y_array_f0 = y_array_f0 + y_array_f1
x_array3 = np.genfromtxt(ListOfFiles[0], skip_header = 1, usecols = (0))
ausgabe_array = np.transpose(np.array((x_array3, y_array_s0, y_array_p0, y_array_d0,y_array_f0)))
np.savetxt('SX_DOS.out', ausgabe_array)
I think you are trying to sum columns in an imported array.
Assuming this is working:
y_array1 = np.genfromtxt(filename, skip_header = 1, usecols = (0, 1, 2, 3, 4))
then
y_array0 = y_array1.sum(axis=0)
should give the sum of the columns.
With regard to plotting, I would recommend matplotlib
Related
I'm using MONAI on Spyder Anaconda to build a U-Net network. I want to add/modify layers starting from this baseline.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nets.UNet(
spatial_dims = 2,
in_channels = 3,
out_channels = 1,
channels = (4, 8, 16, 32, 64),
strides = (2, 2, 2, 2),
num_res_units = 3,
norm = layers.Norm.BATCH,
kernel_size=3,).to(device)
loss_function = losses.DiceLoss()
torch.backends.cudnn.benchmark = True
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-4, weight_decay = 0)
post_pred = Compose([EnsureType(), Activations(sigmoid = True), AsDiscrete(threshold=0.5)])
post_label = Compose([EnsureType()])
inferer = SimpleInferer()
utils.set_determinism(seed=46)
My final aim is to create a MultiResUNet that has different layers such as:
class Conv2d_batchnorm(torch.nn.Module):
'''
2D Convolutional layers
Arguments:
num_in_filters {int} -- number of input filters
num_out_filters {int} -- number of output filters
kernel_size {tuple} -- size of the convolving kernel
stride {tuple} -- stride of the convolution (default: {(1, 1)})
activation {str} -- activation function (default: {'relu'})
'''
def __init__(self, num_in_filters, num_out_filters, kernel_size, stride = (1,1), activation = 'relu'):
super().__init__()
self.activation = activation
self.conv1 = torch.nn.Conv2d(in_channels=num_in_filters, out_channels=num_out_filters, kernel_size=kernel_size, stride=stride, padding = 'same')
self.batchnorm = torch.nn.BatchNorm2d(num_out_filters)
def forward(self,x):
x = self.conv1(x)
x = self.batchnorm(x)
if self.activation == 'relu':
return torch.nn.functional.relu(x)
else:
return x
This is just an example of a different Conv2d layer that I would use instead of the native one of the baseline.
Hope some of you can figure out how to proceed.
Thanks, Fede
I'm trying to replace a column an array of 4 columns by datetime values that I treated. The problem is that it's difficult to keep the same form between the different formats of dataframe, array,....
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:,0,0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = pd.to_datetime(ds.variables["time"][:],origin=pd.Timestamp('1850-01-01'),unit='D')
#np.datetime64(ds.variables["time"][:],'D')
x2 = pd.DataFrame(np.zeros((len(dataw),4), float))
x = np.zeros((len(dataw),4), float)
x[:,0] = time
x[:,1] = long
x[:,2] = lat[:]
x[:,3] = dataw[:]*86400
x=pd.DataFrame(x)
x[:,0] = pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
If I put directly the dates transformed in the array, the result is like: 1.32542e+18
I tried
time = ds.variables["time"][:]
and include it in the array, and then use
x[:,0]=pd.to_datetime(x[:,0],origin=pd.Timestamp('1850-01-01'),unit='D')
I get the error:
TypeError: unhashable type: 'slice'
I tried also directly put:
time=pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
x[:,0] = time[:]
TypeError: unhashable type: 'slice'
try this instead
import numpy as np
import pandas as pd
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:, 0, 0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = np.datetime64(time, 'D')
x = np.zeros((len(dataw), 4), dtype='datetime64[D]')
x[:, 0] = time
x[:, 1] = long
x[:, 2] = lat
x[:, 3] = dataw * 86400
df = pd.DataFrame(x, columns=["Time", "Longitude", "Latitude", "Data"])
Xarray makes the netcdf->pandas workflow quite straightforward:
import xarray as xr
ds = xr.open_dataset('file.nc', engine='netcdf4')
df = ds.to_pandas()
Presuming your time variable is using cf-conventions, Xarray will automatically decode it into datetime objects.
I am trying to identify Global Feature Relationships with SHAP values. The SHAP library returns three matrices and I am trying to select the SHAP matrix however, I am getting this error: "IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed".
The code I have is below:
df_score = spark.sql("select * from sandbox.yt_trng_churn_device")
#XGBoost Model
import pickle
from xgboost import XGBClassifier
from mlflow.tracking import MlflowClient
client = MlflowClient()
local_dir = "/dbfs/FileStore/"
local_path = client.download_artifacts
model_path = '/dbfs/FileStore/'
model = XGBClassifier()
model = pickle.load(open(model_path, 'rb'))
HorizonDate = datetime.datetime(2022, 9, 5)
df = df_score
score_data = df.toPandas()
results = model.predict_proba(score_data)
results_l = model.predict(score_data)
score_data["p"]=pd.Series( (v[1] for v in results) )
score_data["l"]=pd.Series( (v for v in results_l) )
spark.createDataFrame(score_data).createOrReplaceTempView("yt_vw_tmp_dev__scores")
spark.sql("create or replace table sandbox.yt_vw_tmp_dev__scores as select * from yt_vw_tmp_dev__scores")
#SHAP Analysis on XGBoost
from shap import KernelExplainer, summary_plot
sql = """
select d_a.*
from
hive_metastore.sandbox.yt_trng_device d_a
right join
(select decile, msisdn, MSISDN_L2L
from(
select ntile(10) over (order by p desc) as decile, msisdn, MSISDN_L2L
from sandbox.yt_vw_tmp_dev__scores
) inc
order by decile) d_b
on d_a.MSISDN_L2L = d_b.MSISDN_L2L and d_a.msisdn = d_b.msisdn
"""
df = spark.sql(sql).drop('msisdn', 'imei', 'imsi', 'event_date', 'MSISDN_L2L', 'account_id')
score_df = df.toPandas()
mode = score_df.mode().iloc[0]
sample = score_df.sample(n=min(100, score_df.shape[0]), random_state=508502835).fillna(mode)
predict = lambda x: model.predict(pd.DataFrame(x, columns=score_df.columns))
explainer = KernelExplainer(predict, sample, link="identity")
shap_values = explainer.shap_values(sample, l1_reg=False)
# The return of the explainer has three matrices, we will get the shap values one
shap_values = shap_values[ :, :, 0]
I am fairly new to coding but it would be great if someone could give some direction on this
I have the following iteration (for loop) for every row depending on the Indicator 'H' and 'G' in df1. Creates a new column with the product of the selected indicators. Now i would like that this is automatically down for all indicators (if i have more than 'H' and'G'). Unfortuntely I am struggling to put it in a dictionary.
Can anyone help with this? Thank you and have an ecellent week.
df1 =pd.DataFrame({'Country':['Armenia','Azerbaidjan','Belarus','Armenia','Azerbaidjan','Belarus'],\
'Indictaor':['G','G','G','H', 'H', 'H'],'2005':[3,4,5,6,7,4],'2006':[6,3,1,3,5,6]})
df2 = pd.DataFrame({'Year':[2005,2006,2005,2006],
'Country1':['Armenia','Armenia','Azerbaidjan','Azerbaidjan'],
'Country2': ['Belarus','Belarus','Belarus','Belarus']})
df3 = pd.DataFrame({'Year':[2005,2006,2005,2006],
'Country2': ['Belarus','Belarus','Belarus','Belarus'],
'Country1':['Armenia','Armenia','Azerbaidjan','Azerbaidjan'],
'IndictaorGProduct':[15,6,35,5],
'IndictaorHProduct':[24,18,28,30]})
gprod = []
hprod =[]
for row in df4.iterrows() :
c1 = row[1][2]
c2 = row[1][1]
yr = str(row[1][0])
g1 = df1.loc[(df1['Country']==c1)&(df1['Indictaor']=='G')]
g1val = g1[yr].values[0]
g2 = df1.loc[(df1['Country']==c2)&(df1['Indictaor']=='G')]
g2val = g2[yr].values[0]
print(g1val, g2val, g1val*g2val)
gprod.append(g1val*g2val)
df4['GProduct'] = gprod
for row in df4.iterrows() :
c1 = row[1][2]
c2 = row[1][1]
yr = str(row[1][0])
g1 = df1.loc[(df1['Country']==c1)&(df1['Indictaor']=='H')]
g1val = g1[yr].values[0]
g2 = df1.loc[(df1['Country']==c2)&(df1['Indictaor']=='H')]
g2val = g2[yr].values[0]
print(g1val, g2val, g1val*g2val)
gprod.append(g1val*g2val)
df4['HProduct'] = hprod
It depends where you get the Indicators from. Do you decide on them or do you get them from the column?
In case you get them from the respective column you could use the column to get a list with unique values from the column. Then you can loop over the values in a second loop. But note that, depending on your data size, this might not be very efficient.
However here is what you could do:
import pandas as pd
df1 = pd.DataFrame({'Country': ['Armenia', 'Azerbaidjan', 'Belarus', 'Armenia', 'Azerbaidjan', 'Belarus'], \
'Indictaor': ['G', 'G', 'G', 'H', 'H', 'H'], '2005': [3, 4, 5, 6, 7, 4],
'2006': [6, 3, 1, 3, 5, 6]})
df2 = pd.DataFrame({'Year': [2005, 2006, 2005, 2006],
'Country1': ['Armenia', 'Armenia', 'Azerbaidjan', 'Azerbaidjan'],
'Country2': ['Belarus', 'Belarus', 'Belarus', 'Belarus']})
df3 = pd.DataFrame({'Year': [2005, 2006, 2005, 2006],
'Country2': ['Belarus', 'Belarus', 'Belarus', 'Belarus'],
'Country1': ['Armenia', 'Armenia', 'Azerbaidjan', 'Azerbaidjan'],
'IndictaorGProduct': [15, 6, 35, 5],
'IndictaorHProduct': [24, 18, 28, 30]})
cols = ['Year', 'Country2', 'Country1']
df4 = pd.DataFrame(columns=cols)
df4['Year'] = df2['Year']
df4['Country1'] = df2['Country1']
df4['Country2'] = df2['Country2']
indicators = df1["Indictaor"].unique() # get all the unique indicators from the indicators column, you could also manually have alist with the indicators you want to loop over
for i in indicators:
prod = []
for row in df4.iterrows():
c1 = row[1][2]
c2 = row[1][1]
yr = str(row[1][0])
g1 = df1.loc[(df1['Country'] == c1) & (df1['Indictaor'] == i)] # compare to the indicator in the list
g1val = g1[yr].values[0]
g2 = df1.loc[(df1['Country'] == c2) & (df1['Indictaor'] == i)]
g2val = g2[yr].values[0]
print(g1val, g2val, g1val * g2val)
prod.append(g1val * g2val)
colname = "".join([i,"Product"])
df4[colname] = prod
print("Done")
I'm trying to do a churn analysis with R and SQL Server 2016.
I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset.
Well, now I have this function trainModel() which I would use to estimate my random model forest:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
return(rx_forest_model)
}
But when I run the function I get this wrong output:
> system.time({
+ trainModel(sqlSettings, trainTable)
+ })
user system elapsed
0.29 0.07 58.18
Warning message:
In tempGetNumObs(numObs) :
Number of observations not available for this data source. 'numObs' set to 1e6.
And for this warning message, the function trainModel() does not create the object rx_forest_model
Does anyone have any suggestions on how to solve this problem?
After several attempts, I found the reason why the function trainModel() did not function properly.
Is not a connection string problem and is not even a data source type issue.
The problem is in the syntax of function trainModel().
It is enough to eliminate from the body of the function the statement:
return(rx_forest_model)
In this way, the function returns the same warning message, but creates the object rx_forest_model in the correct way.
So, the correct function is:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
}