How do i add blueprint into workflow_set in tidymodels

How do i add blueprint into workflow_set in tidymodels - sparse-matrix

I tried to follow the examples in the
Link 1 - Sparse Matrix
https://www.tidyverse.org/blog/2020/11/tidymodels-sparse-support/
Link 2 - Workflow_sets
https://www.tmwr.org/workflow-sets.html
I had trouble including the blue print into the workflow sets.
In the examples where workflow_set is defined in link 2
no_pre_proc <-
workflow_set(
preproc = list(simple = model_vars),
models = list(MARS = mars_spec, CART = cart_spec, CART_bagged = bag_cart_spec,
RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec)
)
and the way we add blue print into the workflow in link 1
wf_sparse <-
workflow() %>%
add_recipe(text_rec, blueprint = sparse_bp) %>%
add_model(lasso_spec)
wf_default <-
workflow() %>%
add_recipe(text_rec) %>%
add_model(lasso_spec)
Where and how do I add the "blueprint = sparse_bp" option in the workflow_set above?
My attempts were
no_pre_proc <-
workflow_set(
preproc = list(simple = model_vars),
models = list(MARS = mars_spec, CART = cart_spec, CART_bagged = bag_cart_spec,
RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec)) %>%
option_add(update_blueprint(blueprint = sparse_bp))
Running the racing tune gave me this error
Error: Problem with `mutate()` column `option`.
i `option = purrr::map(option, append_options, dots)`.
x All options should be named.
Run `rlang::last_error()` to see where the error occurred
<error/rlang_error>
There were 9 workflows that had no results.
Backtrace:
1. ggplot2::autoplot(...)
2. workflowsets:::autoplot.workflow_set(...)
3. workflowsets:::rank_plot(...)
4. workflowsets:::pick_metric(object, rank_metric, metric)
6. workflowsets:::collect_metrics.workflow_set(x)
7. workflowsets:::check_incompete(x, fail = TRUE)
8. workflowsets:::halt(msg)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
There were 9 workflows that had no results.
Backtrace:
x
1. +-ggplot2::autoplot(...)
2. \-workflowsets:::autoplot.workflow_set(...)
3. \-workflowsets:::rank_plot(...)
4. \-workflowsets:::pick_metric(object, rank_metric, metric)
5. +-tune::collect_metrics(x)
6. \-workflowsets:::collect_metrics.workflow_set(x)
7. \-workflowsets:::check_incompete(x, fail = TRUE)
8. \-workflowsets:::halt(msg)
>
thanks,

Thank you for asking this question; we definitely are not supporting this use case (passing non-default arguments to the recipe or model) very well right now. We've opened an issue here where you can track our work on this.
In the meantime, you could try a bit of a hacky workaround by manually using update_recipe() on the workflow you are interested in:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(parabolic)
set.seed(1)
split <- initial_split(parabolic)
train_set <- training(split)
test_set <- testing(split)
glmnet_spec <-
logistic_reg(penalty = 0.1, mixture = 0) %>%
set_engine("glmnet")
rec <-
recipe(class ~ ., data = train_set) %>%
step_YeoJohnson(all_numeric_predictors())
sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix")
wfs_orig <-
workflow_set(
preproc = list(yj = rec,
norm = rec %>% step_normalize(all_numeric_predictors())),
models = list(regularized = glmnet_spec)
)
new_wf <-
wfs_orig %>%
extract_workflow("yj_regularized") %>%
update_recipe(rec, blueprint = sparse_bp)
Created on 2021-12-09 by the reprex package (v2.0.1)
Then (I know this feels hacky for now) manually take this new_wf and stick it in to the wfs_orig$info[[1]]$workflow slot to replace what is there.

Related

Lapply function to anova and post hoc test cld

I am new to r and I am trying to get my mind around the apply function. So far I managed to run my anovas for all the the variables on my data and I got the pairwise comparison.
varlist <- names(dt)[5:length(dt)]
# loop
models <- lapply(X = varlist,
FUN = function(t) lm(formula = paste0("`", t, "` ~ block+irrigation*genotype"), data = dt))
#Name the list of models to the column name
names(models) = varlist
## apply anova to each model stored in the list, models
lapply(models, anova)
#marginal-means-all-variable}
res.model1 <- lapply(models, function(x) pairs(emmeans(x, ~genotype:irrigation)))
res.model1
So far so good, now I want to create a compact letter list so I can use to plot it. Previously I used the following but I can't work out how to apply an lapply function to the following code
CLD = cld(res.model1,
alpha=0.05,
Letters=letters,
adjust="tukey")
I use the CLD data to create graphs
I manage to get the letters with the following code but then I am not getting the full anova table.
tx <- with(dt, interaction(irrigation, genotype)) # determining the factors
model2 <- lapply(varlist, function(x) {
lm(substitute(i~block+tx, list(i = as.name(x))), data = dt)}) # using the factors already in "tx"
lapply(model2, anova)
letters = lapply(model2, function(m) HSD.test((m), "tx", alpha = 0.05, group = TRUE, console = TRUE))
Any suggestions to achieve what I need.
Thank you

Modelling and fitting bi-modal lognormal distributions in a loop using lmfit

I have been spending FAR too much time trying to figure this out - so time to seek help. I am attempting to use lmfit to fit two lognormals (a and c) as well as the sum of these two lognormals (a+c) to a size distribution. Mode a centers around x=0.2, y=1, mode c centers around x=1.2, y=<<<1. There are numerous size distributions (>200) which are all slightly different and are passed in to the following code from an outside loop. For this example, I have provided a real life distribution and have not included the loop. Hopefully my code is sufficiently annotated to allow understanding of what I am trying to achieve.
I must be missing some fundamental understanding of lmfit (spoiler alert - I'm not great at Maths either) as I have 2 problems:
the fits (a, c and a+c) do not accurately represent the data. Note how the fit (red solid line) diverts away from the data (blue solid line). I assume this is something to do with the initial guess parameters. I have tried LOTS and have been unable to get a good fit.
re-running the model with "new" best fit values (results2, results3) doesn't appear to significantly improve the fit at all. Why?
Example result using provided x and y data:
Here is one-I-made-earlier showing the type of fit I am after (produced using the older mpfit module, using different data than provided below and using unique initial best guess parameters (not in a loop). Excuse the legend format, I had to remove certain information):
Any assistance is much appreciated. Here is the code with an example distribution:
from lmfit import models
import matplotlib.pyplot as plt
import numpy as np
# real life data example
y = np.array([1.000000, 0.754712, 0.610303, 0.527856, 0.412125, 0.329689, 0.255756, 0.184424, 0.136819,
0.102316, 0.078763, 0.060896, 0.047118, 0.020297, 0.007714, 0.010202, 0.008710, 0.005579,
0.004644, 0.004043, 0.002618, 0.001194, 0.001263, 0.001043, 0.000584, 0.000330, 0.000179,
0.000117, 0.000050, 0.000035, 0.000017, 0.000007])
x = np.array([0.124980, 0.130042, 0.135712, 0.141490, 0.147659, 0.154711, 0.162421, 0.170855, 0.180262,
0.191324, 0.203064, 0.215738, 0.232411, 0.261810, 0.320252, 0.360761, 0.448802, 0.482528,
0.525526, 0.581518, 0.658988, 0.870114, 1.001815, 1.238899, 1.341285, 1.535134, 1.691963,
1.973359, 2.285620, 2.572177, 2.900414, 3.342739])
# create the joint model using prefixes for each mode
model = (models.LognormalModel(prefix='p1_') +
models.LognormalModel(prefix='p2_'))
# add some best guesses for the model parameters
params = model.make_params(p1_center=0.1, p1_sigma=2, p1_amplitude=1,
p2_center=1, p2_sigma=2, p2_amplitude=0.000000000000001)
# bound those best guesses
# params['p1_amplitude'].min = 0.0
# params['p1_amplitude'].max = 1e5
# params['p1_sigma'].min = 1.01
# params['p1_sigma'].max = 5
# params['p1_center'].min = 0.01
# params['p1_center'].max = 1.0
#
# params['p2_amplitude'].min = 0.0
# params['p2_amplitude'].max = 1
# params['p2_sigma'].min = 1.01
# params['p2_sigma'].max = 10
# params['p2_center'].min = 1.0
# params['p2_center'].max = 3
# actually fit the model
result = model.fit(y, params, x=x)
# ====================================
# ================================
# re-run using the best-fit params derived above
params2 = model.make_params(p1_center=result.best_values['p1_center'], p1_sigma=result.best_values['p1_sigma'],
p1_amplitude=result.best_values['p1_amplitude'],
p2_center=result.best_values['p2_center'], p2_sigma=result.best_values['p2_sigma'],
p2_amplitude=result.best_values['p2_amplitude'], )
# re-fit the model
result2 = model.fit(y, params2, x=x)
# ================================
# re-run again using the best-fit params derived above
params3 = model.make_params(p1_center=result2.best_values['p1_center'], p1_sigma=result2.best_values['p1_sigma'],
p1_amplitude=result2.best_values['p1_amplitude'],
p2_center=result2.best_values['p2_center'], p2_sigma=result2.best_values['p2_sigma'],
p2_amplitude=result2.best_values['p2_amplitude'], )
# re-fit the model
result3 = model.fit(y, params3, x=x)
# ================================
# add individual fine and coarse modes using the revised fit parameters
model_a = models.LognormalModel()
params_a = model_a.make_params(center=result3.best_values['p1_center'], sigma=result3.best_values['p1_sigma'],
amplitude=result3.best_values['p1_amplitude'])
result_a = model_a.fit(y, params_a, x=x)
model_c = models.LognormalModel()
params_c = model_c.make_params(center=result3.best_values['p2_center'], sigma=result3.best_values['p2_sigma'],
amplitude=result3.best_values['p2_amplitude'])
result_c = model_c.fit(y, params_c, x=x)
# ====================================
plt.plot(x, y, 'b-', label='data')
plt.plot(x, result.best_fit, 'r-', label='best_fit_1')
plt.plot(x, result.init_fit, 'lightgrey', ls=':', label='ini_fit_1')
plt.plot(x, result2.best_fit, 'r--', label='best_fit_2')
plt.plot(x, result2.init_fit, 'lightgrey', ls='--', label='ini_fit_2')
plt.plot(x, result3.best_fit, 'r.-', label='best_fit_3')
plt.plot(x, result3.init_fit, 'lightgrey', ls='--', label='ini_fit_3')
plt.plot(x, result_a.best_fit, 'grey', ls=':', label='best_fit_a')
plt.plot(x, result_c.best_fit, 'grey', ls='--', label='best_fit_c')
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()

There are three main pieces of advice I can give:
initial values matter and should not be so far off as to make
portions of the model completely insensitive to the parameter
values. Your initial model is sort of off by several orders of
magnitude.
always look at the fit result. This is the primary
result -- the plot of the fit is a representation of the actual
numerical results. Not showing that you printed out the fit
report is a good indication that you did not look at the actual
result. Really, always look at the results.
if you are judging the quality of the fit based on a plot of
the data and model, use how you choose to plot the data to guide
how you fit the data. Specifically in your case, if you are
plotting on a log scale, then fit the log of the data to the log
of the model: fit in "log space".
Such a fit might look like this:
from lmfit import models, Model
from lmfit.lineshapes import lognormal
import matplotlib.pyplot as plt
import numpy as np
y = np.array([1.000000, 0.754712, 0.610303, 0.527856, 0.412125, 0.329689, 0.255756, 0.184424, 0.136819,
0.102316, 0.078763, 0.060896, 0.047118, 0.020297, 0.007714, 0.010202, 0.008710, 0.005579,
0.004644, 0.004043, 0.002618, 0.001194, 0.001263, 0.001043, 0.000584, 0.000330, 0.000179,
0.000117, 0.000050, 0.000035, 0.000017, 0.000007])
x = np.array([0.124980, 0.130042, 0.135712, 0.141490, 0.147659, 0.154711, 0.162421, 0.170855, 0.180262,
0.191324, 0.203064, 0.215738, 0.232411, 0.261810, 0.320252, 0.360761, 0.448802, 0.482528,
0.525526, 0.581518, 0.658988, 0.870114, 1.001815, 1.238899, 1.341285, 1.535134, 1.691963,
1.973359, 2.285620, 2.572177, 2.900414, 3.342739])
# use a model that is the log of the sum of two log-normal functions
# note to be careful about log(x) for x < 0.
def log_lognormal(x, amp1, cen1, sig1, amp2, cen2, sig2):
comp1 = lognormal(x, amp1, cen1, sig1)
comp2 = lognormal(x, amp2, cen2, sig2)
total = comp1 + comp2
total[np.where(total<1.e-99)] = 1.e-99
return np.log(comp1+comp2)
model = Model(log_lognormal)
params = model.make_params(amp1=5.0, cen1=-4, sig1=1,
amp2=0.1, cen2=-1, sig2=1)
# part of making sure that the lognormals are strictly positive
params['amp1'].min = 0
params['amp2'].min = 0
result = model.fit(np.log(y), params, x=x)
print(result.fit_report()) # <-- HERE IS WHERE THE RESULTS ARE!!
# also, make a plot of data and fit
plt.plot(x, y, 'b-', label='data')
plt.plot(x, np.exp(result.best_fit), 'r-', label='best_fit')
plt.plot(x, np.exp(result.init_fit), 'grey', label='ini_fit')
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
This will print out
[[Model]]
Model(log_lognormal)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 211
# data points = 32
# variables = 6
chi-square = 0.91190970
reduced chi-square = 0.03507345
Akaike info crit = -101.854407
Bayesian info crit = -93.0599914
[[Variables]]
amp1: 21.3565856 +/- 193.951379 (908.16%) (init = 5)
cen1: -4.40637490 +/- 3.81299642 (86.53%) (init = -4)
sig1: 0.77286862 +/- 0.55925566 (72.36%) (init = 1)
amp2: 0.00401804 +/- 7.5833e-04 (18.87%) (init = 0.1)
cen2: -0.74055538 +/- 0.13043827 (17.61%) (init = -1)
sig2: 0.64346873 +/- 0.04102122 (6.38%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp1, cen1) = -0.999
C(cen1, sig1) = -0.999
C(amp1, sig1) = 0.997
C(cen2, sig2) = -0.964
C(amp2, cen2) = -0.940
C(amp2, sig2) = 0.849
C(sig1, amp2) = -0.758
C(cen1, amp2) = 0.740
C(amp1, amp2) = -0.726
C(sig1, cen2) = 0.687
C(cen1, cen2) = -0.669
C(amp1, cen2) = 0.655
C(sig1, sig2) = -0.598
C(cen1, sig2) = 0.581
C(amp1, sig2) = -0.567
and generate a plot like

Export command history in Rstudio

I am trying to export all the command history from Rstudio in Mac. But in the "History" tab or using savehistory(file = ".Rhistory"), only about 500 hundred latest commands are exported. However, when I directly search commands, I can see the commands way earlier than these 500 commands, for example half of a year ago. So where does Rstudio store all these histories? How can I export them? If the final history contains the date of the commands that I input, it would be even better. Thank you a lot.

After searching with the same question, here's an answer in case anyone else comes across the same issue:
savehistory is a base R function that will save everything from the current session, whereas RStudio has a seperate database that the history window uses. The location of this depends on your operating system (more info here).
Windows:
%localappdata%\RStudio
macOS:
open ~/.local/share/rstudio
Once you've located the history_database, this helpful code by #vzemlys allows you to export the contents to a text file with "nicer formatting" (i.e. data and time stamps for each line of code history):
library(dplyr)
library(magrittr)
library(lubridate)
library(bit64)
library(stringr)
lns <- readLines("yourfilelocation/history_database") %>% str_split(pattern=":",n=2)
hist_db <- data_frame(epoch=as.integer64(sapply(lns,"[[",1)),history=sapply(lns,"[[",2))
hist_db %<>% mutate(nice_date = as.POSIXct(epoch/1000,origin = "1970-01-01",tz = "EET"))
hist_db %<>% mutate(day = ceiling_date(nice_date,unit = "day")-days(1))
hist_db %<>% dplyr::select(-epoch)
dd <- hist_db$day %>% unique %>% sort
ff <- "hist_nice.txt"
cat("R history","\n",rep("-",80),"\n",file=ff,sep="")
for(i in 1:length(dd)) {
cat("\n\n",format(dd[i]),"\n",rep("-",80),"\n",file=ff,sep="",append=TRUE)
hist_db %>% filter(day==dd[i]) %>% dplyr::select(nice_date,history) %>% arrange(nice_date) %>%
write.table(ff,sep="\t", quote=F, row.names=FALSE, col.names=FALSE, append=TRUE)
}
Example output:
--------------------------------------------------------------------------------
R history
--------------------------------------------------------------------------------
2022-06-08
--------------------------------------------------------------------------------
2022-06-08 13:32:57 test <- testfile %>% pivot_wider(names_from=scenario, values_from=sst) %>% na.omit()
2022-06-08 13:32:57 ggplot() + theme_bw() +
2022-06-08 13:32:57 geom_tile(data = test, aes(x,y,colour=layer), alpha=0.8) +
2022-06-08 13:32:57 scale_color_viridis() +
2022-06-08 13:32:57 coord_equal() +
2022-06-08 13:32:57 theme(legend.position="bottom")
--------------------------------------------------------------------------------

R: how to properly create rx_forest_model object?

I'm trying to do a churn analysis with R and SQL Server 2016.
I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset.
Well, now I have this function trainModel() which I would use to estimate my random model forest:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
return(rx_forest_model)
}
But when I run the function I get this wrong output:
> system.time({
+ trainModel(sqlSettings, trainTable)
+ })
user system elapsed
0.29 0.07 58.18
Warning message:
In tempGetNumObs(numObs) :
Number of observations not available for this data source. 'numObs' set to 1e6.
And for this warning message, the function trainModel() does not create the object rx_forest_model
Does anyone have any suggestions on how to solve this problem?

After several attempts, I found the reason why the function trainModel() did not function properly.
Is not a connection string problem and is not even a data source type issue.
The problem is in the syntax of function trainModel().
It is enough to eliminate from the body of the function the statement:
return(rx_forest_model)
In this way, the function returns the same warning message, but creates the object rx_forest_model in the correct way.
So, the correct function is:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
}

gWidgets return multiple responses for multiple columns

I posted before on this and was not able to arrive at a solution with the single posted response. I have tried another approach that gets me what I want, mostly. The current solution would work better if it were in a single dialogue box but I am not sure how to do this. I am putting together other solutions I found on Nabble and stack and just cannot get what I need. I want the following to offer the choices in a single window, instead of having the user input into 3 separate boxes. Please help.
options(guiToolkit="RGtk2")
library(gWidgets)
#Creat selection function
select <- function(x,multiple = TRUE, Title,...){
ans<-new.env()
x1<-ggroup(horizontal=TRUE) # no parent container here
x2<-gcheckboxgroup( x,multiple=multiple,con=x1,expand=TRUE)
ret <- gbasicdialog(title = Title, widget=x1,handler=function(h,...){
value <- svalue(x2)
if (length(value)==0) value=""
assign("selected",value,env=h$action$env)
dispose(x1)
},action=list(env=ans))
ans
}
#Create list to store results
Days. <- c("Day1","Day2","Day3")
Outputs_ <- list()
SelectionOptions. <- c("Bicycle Event1", "Construction Nearby","Path Closure")
#Cycle through each day
for(day in Days.){
ans <- select(SelectionOptions., Title = day)
Outputs_[[day]] <- ans$selected
}
#return results of selection
unlist(Outputs_)

You are making it a bit more complicated than need be.
you are better off making things asynchronous (in most cases), using a handler to be called when the values are selected.
However, if you want it done this way, make a container for your selection widgets, then add them to the container. This container is then passed into gbasicdialog. That way all will show in one window. Using gWidgets2 makes working with the basic dialog a lot easier.

Okay, here is an example, though I'm not quite sure what problem you are trying to solve, so may be off:
library(gWidgets2)
Days. <- c("Day1","Day2","Day3")
SelectionOptions. <- c("Bicycle Event1", "Construction Nearby","Path Closure")
w <- gwindow()
g <- gformlayout(cont=w)
days <- gradio(Days., cont=g, label="Days:")
opts <- gcombobox(SelectionOptions., cont=g, label="Options:")
btn <- gbutton("Show them", cont=g, label="")
addHandlerClicked(btn, handler=function(h,...) {
values <- lapply(list(days, opts), svalue)
print(values)
})
## or using a modal dialog
dlg <- gbasicdialog(handler=function(h,...) {
values <- lapply(list(days, opts), svalue)
print(values)
})
g <- gformlayout(cont=dlg)
days <- gradio(Days., cont=g, label="Days:")
opts <- gcombobox(SelectionOptions., cont=g, label="Options:")
visible(dlg, TRUE)
## third time is a charm? (Though you could clean this up)
library(gWidgets2)
e <- new.env()
e$Days <- ""; e$Sel <- ""
Days. <- c("Day1","Day2","Day3")
SelectionOptions. <- c("Bicycle Event1", "Construction Nearby","Path Closure")
dlg <- gbasicdialog(title="test", do.buttons=FALSE)
g <- ggroup(cont=dlg, horizontal=FALSE)
fl <- gformlayout(cont=g)
days <- gradio(Days., cont=fl, label="Days:")
cb <- gcombobox(SelectionOptions., selected=0, cont=fl, label="Options:")
handler <- function(h,...) {
e$Days <- svalue(days)
e$Sel <- ifelse(length(svalue(cb)) > 0, svalue(cb), "")
if (all(sapply(e, function(x) nchar(x) > 0)))
dlg$dispose_window()
}
sapply(list(days, cb), addHandlerChanged, handler=handler)
visible(dlg, TRUE)
Take 4
library(gWidgets2)
days <- paste("Day", 1:3)
events <- c("none", "Bicycle Event","Construction Nearby","Path Closure")
l <- list()
handler <- function(h,...) {
l[[h$action]] <<- svalue(h$obj)
if(length(l) == 3)
print(l)
}
w <- gwindow()
g <- gvbox(cont=w)
for (day in days) {
l <- glabel(day, cont=g); font(l) <- list(weight="bold")
gradio(events, cont=g, handler=handler, action=day)
}