R: how to properly create rx_forest_model object? - sql-server

I'm trying to do a churn analysis with R and SQL Server 2016.
I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset.
Well, now I have this function trainModel() which I would use to estimate my random model forest:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
return(rx_forest_model)
}
But when I run the function I get this wrong output:
> system.time({
+ trainModel(sqlSettings, trainTable)
+ })
user system elapsed
0.29 0.07 58.18
Warning message:
In tempGetNumObs(numObs) :
Number of observations not available for this data source. 'numObs' set to 1e6.
And for this warning message, the function trainModel() does not create the object rx_forest_model
Does anyone have any suggestions on how to solve this problem?

After several attempts, I found the reason why the function trainModel() did not function properly.
Is not a connection string problem and is not even a data source type issue.
The problem is in the syntax of function trainModel().
It is enough to eliminate from the body of the function the statement:
return(rx_forest_model)
In this way, the function returns the same warning message, but creates the object rx_forest_model in the correct way.
So, the correct function is:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
}

Related

How to create my own layers on MONAI U-Net?

I'm using MONAI on Spyder Anaconda to build a U-Net network. I want to add/modify layers starting from this baseline.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nets.UNet(
spatial_dims = 2,
in_channels = 3,
out_channels = 1,
channels = (4, 8, 16, 32, 64),
strides = (2, 2, 2, 2),
num_res_units = 3,
norm = layers.Norm.BATCH,
kernel_size=3,).to(device)
loss_function = losses.DiceLoss()
torch.backends.cudnn.benchmark = True
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-4, weight_decay = 0)
post_pred = Compose([EnsureType(), Activations(sigmoid = True), AsDiscrete(threshold=0.5)])
post_label = Compose([EnsureType()])
inferer = SimpleInferer()
utils.set_determinism(seed=46)
My final aim is to create a MultiResUNet that has different layers such as:
class Conv2d_batchnorm(torch.nn.Module):
'''
2D Convolutional layers
Arguments:
num_in_filters {int} -- number of input filters
num_out_filters {int} -- number of output filters
kernel_size {tuple} -- size of the convolving kernel
stride {tuple} -- stride of the convolution (default: {(1, 1)})
activation {str} -- activation function (default: {'relu'})
'''
def __init__(self, num_in_filters, num_out_filters, kernel_size, stride = (1,1), activation = 'relu'):
super().__init__()
self.activation = activation
self.conv1 = torch.nn.Conv2d(in_channels=num_in_filters, out_channels=num_out_filters, kernel_size=kernel_size, stride=stride, padding = 'same')
self.batchnorm = torch.nn.BatchNorm2d(num_out_filters)
def forward(self,x):
x = self.conv1(x)
x = self.batchnorm(x)
if self.activation == 'relu':
return torch.nn.functional.relu(x)
else:
return x
This is just an example of a different Conv2d layer that I would use instead of the native one of the baseline.
Hope some of you can figure out how to proceed.
Thanks, Fede

Python : Replace a column in a dataframe by datetime values

I'm trying to replace a column an array of 4 columns by datetime values that I treated. The problem is that it's difficult to keep the same form between the different formats of dataframe, array,....
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:,0,0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = pd.to_datetime(ds.variables["time"][:],origin=pd.Timestamp('1850-01-01'),unit='D')
#np.datetime64(ds.variables["time"][:],'D')
x2 = pd.DataFrame(np.zeros((len(dataw),4), float))
x = np.zeros((len(dataw),4), float)
x[:,0] = time
x[:,1] = long
x[:,2] = lat[:]
x[:,3] = dataw[:]*86400
x=pd.DataFrame(x)
x[:,0] = pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
If I put directly the dates transformed in the array, the result is like: 1.32542e+18
I tried
time = ds.variables["time"][:]
and include it in the array, and then use
x[:,0]=pd.to_datetime(x[:,0],origin=pd.Timestamp('1850-01-01'),unit='D')
I get the error:
TypeError: unhashable type: 'slice'
I tried also directly put:
time=pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
x[:,0] = time[:]
TypeError: unhashable type: 'slice'
try this instead
import numpy as np
import pandas as pd
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:, 0, 0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = np.datetime64(time, 'D')
x = np.zeros((len(dataw), 4), dtype='datetime64[D]')
x[:, 0] = time
x[:, 1] = long
x[:, 2] = lat
x[:, 3] = dataw * 86400
df = pd.DataFrame(x, columns=["Time", "Longitude", "Latitude", "Data"])
Xarray makes the netcdf->pandas workflow quite straightforward:
import xarray as xr
ds = xr.open_dataset('file.nc', engine='netcdf4')
df = ds.to_pandas()
Presuming your time variable is using cf-conventions, Xarray will automatically decode it into datetime objects.

Error input size in adaptive_avg_pool2d() MONAI

I'm using MONAI to build a U-Net for a segmentation task. I want to crop the input images with "CropForegroundd" as follow:
monai_load = [
LoadImaged(keys=["image","segmentation"],image_only=False,reader=PILReader()),
EnsureTyped(keys=["image", "segmentation"], data_type="numpy"),
AddChanneld(keys=["segmentation","image"]),
RepeatChanneld(keys=["image"],repeats=3),
AsChannelFirstd(keys=["image"], channel_dim = 0),
]
monai_transforms =[
AsDiscreted(keys=["segmentation"],threshold=0.5),
CropForegroundd(keys=["image","segmentation"], source_key="image", margin=20),
Resized(keys=["image","segmentation"],spatial_size=(240,240)),
ToTensord(keys=["image","segmentation"]),
]
train_transforms = Compose(monai_load + monai_transforms)
train_ds = IterableDataset(data = train_data, transform = train_transforms)
train_loader = DataLoader(dataset = train_ds, batch_size = batch_size, num_workers = 0, pin_memory = True)
While training running, it stops in a certain step number and give me this error:
RuntimeError: adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, but input has sizes [1, 3, 0, 0] with dimension 2 being empty
I think that the problem is not CropForegroundd margin parameter but I'm not able to solve this error.
Thank you in advance for getting back to me.
Federico

IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed

I am trying to identify Global Feature Relationships with SHAP values. The SHAP library returns three matrices and I am trying to select the SHAP matrix however, I am getting this error: "IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed".
The code I have is below:
df_score = spark.sql("select * from sandbox.yt_trng_churn_device")
#XGBoost Model
import pickle
from xgboost import XGBClassifier
from mlflow.tracking import MlflowClient
client = MlflowClient()
local_dir = "/dbfs/FileStore/"
local_path = client.download_artifacts
model_path = '/dbfs/FileStore/'
model = XGBClassifier()
model = pickle.load(open(model_path, 'rb'))
HorizonDate = datetime.datetime(2022, 9, 5)
df = df_score
score_data = df.toPandas()
results = model.predict_proba(score_data)
results_l = model.predict(score_data)
score_data["p"]=pd.Series( (v[1] for v in results) )
score_data["l"]=pd.Series( (v for v in results_l) )
spark.createDataFrame(score_data).createOrReplaceTempView("yt_vw_tmp_dev__scores")
spark.sql("create or replace table sandbox.yt_vw_tmp_dev__scores as select * from yt_vw_tmp_dev__scores")
#SHAP Analysis on XGBoost
from shap import KernelExplainer, summary_plot
sql = """
select d_a.*
from
hive_metastore.sandbox.yt_trng_device d_a
right join
(select decile, msisdn, MSISDN_L2L
from(
select ntile(10) over (order by p desc) as decile, msisdn, MSISDN_L2L
from sandbox.yt_vw_tmp_dev__scores
) inc
order by decile) d_b
on d_a.MSISDN_L2L = d_b.MSISDN_L2L and d_a.msisdn = d_b.msisdn
"""
df = spark.sql(sql).drop('msisdn', 'imei', 'imsi', 'event_date', 'MSISDN_L2L', 'account_id')
score_df = df.toPandas()
mode = score_df.mode().iloc[0]
sample = score_df.sample(n=min(100, score_df.shape[0]), random_state=508502835).fillna(mode)
predict = lambda x: model.predict(pd.DataFrame(x, columns=score_df.columns))
explainer = KernelExplainer(predict, sample, link="identity")
shap_values = explainer.shap_values(sample, l1_reg=False)
# The return of the explainer has three matrices, we will get the shap values one
shap_values = shap_values[ :, :, 0]
I am fairly new to coding but it would be great if someone could give some direction on this

Map Layer Issues in ggplot2

I'm having a few issues with finalizing my map for a report. I think I'm warm on the solutions, but haven't quite figured them out. I would really appreciate any help on solutions so that I can finally move on!
1) The scale bar will NOT populate in the MainMap code and the subsequent Figure1 plot. This is "remedied" in the MainMap code if I comment out the "BCWA_land" map layer. However, when I retain the "BCWA_land" map layer it will eliminate the scale bar and produces this error:
Warning message: Removed 3 rows containing missing values (geom_text).
And this is the code:
MainMap <- ggplot(QOI) +
geom_sf(aes(fill = quadID)) +
scale_fill_manual(values = c("#6b8c42",
"#70b2ae",
"#d65a31")) +
labs(fill = "Quadrants of Interest",
caption = "Figure 1: Map depicting the quadrants in the WSDOT project area as well as other quadrants of interest in the Puget Sound area.")+
ggtitle("WSDOT Project Area and Quadrants of Interest") +
scalebar(x.min = -123, x.max = -122.8, y.min = 47, y.max = 47.1, location = "bottomleft",
transform = TRUE, dist = 10, dist_unit = "km", st.size = 3, st.bottom = TRUE, st.dist = 0.1) +
north(data = QOI, location = "topleft", scale = 0.1, symbol = 12, x.min = -123, y.min = 48.3, x.max = -122.7, y.max = 48.4) +
theme_bw()+
theme(panel.grid= element_line(color = "gray50"),
panel.background = element_blank(),
panel.ontop = TRUE,
legend.text = element_text(size = 11, margin = margin(l = 3), hjust = 0),
legend.position = c(0.95, 0.1),
legend.justification = c(0.85, 0.1),
legend.background = element_rect(colour = "#3c4245", fill = "#f4f4f3"),
axis.title = element_blank(),
plot.title = element_text(face = "bold", colour = "#3c4245", hjust = 0.5, margin = margin(b=10, unit = "pt")),
plot.caption = element_text(face = "italic", colour = "#3c4245", margin = margin(t = 7), hjust = 0, vjust = 0.5)) +
geom_sf(data = BCWA_land) + #this is what I've tried to comment out to determine the scale bar problem
xlim (-123.1, -121.4) +
ylim (47.0, 48.45)
MainMap
InsetRect <- data.frame(xmin=-123.2, xmax=-122.1, ymin=47.02, ymax=48.45)
InsetMap <- ggplotGrob( ggplot( quads) +
geom_sf(aes(fill = "")) +
scale_fill_manual(values = c("#eefbfb"))+
geom_sf(data = BCWA_land) +
scale_x_continuous(expand = c(0,0), limits = c(-124.5, -122.0)) +
scale_y_continuous(expand = c(0,0), limits = c(47.0, 49.5)) +
geom_rect(data = InsetRect,aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax),
color="#3c4245",
size=1.25,
fill=NA,
inherit.aes = FALSE) +
theme_bw()+
theme(legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(0,0,0,0)))
InsetMap
Figure1 <- MainMap +
annotation_custom(grob = InsetMap, xmin = -122.2, xmax = -121.3,
ymin = 47.75, ymax = 48.5)
Figure1
As you can see I'm not getting this issue or error for my north arrow so I'm not really sure what is happening with the scale bar!
This problem is probably a little too OCD, however I REALLY don't want the gridlines to show up on the InsetMap, and was hoping that the InsetMap would overlay on top of the MainMap, without gridlines as I had those parameters set to element_blank() in the InsetMap code.
Here is an image of my plot. If you would like the data for this, please let me know. Because these are shapefiles, the data is unwieldy and not super conducive to SO's character limit for a post...
If anyone has any insight into a solution(s) I would so so appreciate that!! Thanks for your time!
The issue was the
xlim (-123.1, -121.4) +
ylim (47.0, 48.45)
call that I made. Instead, I used coord_sf(xlim = c(min, max), ylim = c(min, max)). I thought that this would be helpful to someone who might be in my position later on!
Essentially the difference between setting the limits of to the graph using just the x/y lim calls is that that truncates the data that is available in your dataset, whereas the coord_sf call simply "focuses" your graph on that extent if you will without altering the data you have available in your dataset.

Resources