Wrong outputs from torch.sub? - arrays

I’m currently using torch.sub alongside torch.div to obtain the MAPE between my predicted and true labels for my neural network although I’m not getting the answers I’m expecting. According to the example in the documentation, I should be getting a 4x1 tensor, not 4x4.
Could anyone clear this up for me?
print('y_true ', y_true)
y_true tensor([[ 46],
[262],
[ 33],
[ 35]], device=‘cuda:0’, dtype=torch.int16)
print('y_pred ', y_pred)
y_pred tensor([[[308.5075]],
[[375.8983]],
[[389.4587]],
[[406.4957]]], device=‘cuda:0’, grad_fn=)
print('torch.sub ', torch.sub(y_true, y_pred))
torch.sub tensor([[[-262.5075],
[ -46.5075],
[-275.5075],
[-273.5075]],
[[-329.8983],
[-113.8983],
[-342.8983],
[-340.8983]],
[[-343.4587],
[-127.4587],
[-356.4587],
[-354.4587]],
[[-360.4957],
[-144.4957],
[-373.4957],
[-371.4957]]], device='cuda:0', grad_fn=<SubBackward0>)

That is because y_pred has an extra dimension which means the y_true tensor
probably gets broadcasted to the correct dimension.
If you remove the extra last dimension you get the desired result:
>>> torch.sub(y_true, y_pred[...,0]).shape
torch.Size([4, 1])

Related

Lasso via GridSearchCV: ConvergenceWarning: Objective did not converge

I am trying to find the optimal parameter of a Lasso regression:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])
My variables are demeaned and standardized. But I get the following error:
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.279e+02, tolerance: 6.395e-01
I know this error has been reported several times, but none of the previous posts answer how to solve it. In my case, the error is generated by the fact that the lowerbound 0.000005 is very small, but this is a reasonable value as indicated by solving the tuning problem via the information criteria:
lasso_aic = LassoLarsIC(criterion='aic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_aic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_aic.alpha_))
lasso_bic = LassoLarsIC(criterion='bic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_bic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_bic.alpha_))
AIC and BIC give values of around 0.000008. How can this warning be solved?
Increasing the default parameter max_iter=1000 in Lasso will do the job:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True, max_iter=5000)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])

How can I get list values that meet specific conditions?

I'm an introductory member of Python.
I are implementing code to organize data with Python.
I have to extract a value that meets only certain conditions out of the numerous lists.
It seems very simple, but it feels too difficult for me.
First, let me explain with the simplest example
solutions
Out[73]:
array([[ 2.31350622e-04, -1.42539948e-02, -7.17361833e-02,
2.17545418e-01, -3.38251827e-01, 1.88254191e-01],
[ 4.23523963e-82, -9.48255372e-81, 5.22018863e-80,
-1.11271010e-79, 1.03507672e-79, -3.55573390e-80],
[ 2.31350597e-04, -1.42539951e-02, -7.17361800e-02,
2.17545409e-01, -3.38251817e-01, 1.88254187e-01],
[ 2.58309722e-02, -6.21550000e-01, 3.41867505e+00,
-7.53828444e+00, 7.09091365e+00, -2.39409614e+00],
[ 2.31350606e-04, -1.42539950e-02, -7.17361809e-02,
2.17545411e-01, -3.38251820e-01, 1.88254188e-01],
[ 1.14525725e-02, -3.25174709e-01, 2.11632584e+00,
-5.16113713e+00, 5.12508331e+00, -1.78380602e+00],
[ 9.75839726e-03, -3.08729919e-01, 2.26983591e+00,
-6.16462170e+00, 6.76409438e+00, -2.55992476e+00],
[ 1.13190092e-03, -6.72042220e-02, 7.10413638e-01,
-2.39952623e+00, 2.94849402e+00, -1.18046338e+00],
[ 5.24406689e-03, -1.86240596e-01, 1.36500589e+00,
-3.61106144e+00, 3.75606312e+00, -1.34699295e+00]])
coeff
Out[74]:
array([[ 1.03177808e-04, -6.35700011e-03, -3.19929208e-02,
9.70209594e-02, -1.50853634e-01, 8.39576506e-02,
4.45980248e-01],
[ 5.13911499e-83, -1.15062991e-81, 6.33426960e-81,
-1.35018220e-80, 1.25598048e-80, -4.31459067e-81,
1.21341776e-01],
[ 1.03177797e-04, -6.35700027e-03, -3.19929194e-02,
9.70209556e-02, -1.50853630e-01, 8.39576490e-02,
4.45980249e-01],
[ 4.26209161e-03, -1.02555298e-01, 5.64078896e-01,
-1.24381145e+00, 1.16999559e+00, -3.95024121e-01,
1.64999272e-01],
[ 1.03177801e-04, -6.35700023e-03, -3.19929198e-02,
9.70209566e-02, -1.50853631e-01, 8.39576495e-02,
4.45980248e-01],
[ 2.27512838e-03, -6.45980810e-02, 4.20421959e-01,
-1.02529362e+00, 1.01813129e+00, -3.54364724e-01,
1.98656535e-01],
[ 1.42058482e-03, -4.49435521e-02, 3.30432790e-01,
-8.97418681e-01, 9.84687293e-01, -3.72662657e-01,
1.45575629e-01],
[ 2.46722650e-04, -1.46486353e-02, 1.54850246e-01,
-5.23029411e-01, 6.42688990e-01, -2.57307904e-01,
2.17971950e-01],
[ 1.30617191e-03, -4.63880878e-02, 3.39990392e-01,
-8.99429225e-01, 9.35545685e-01, -3.35503798e-01,
2.49076135e-01]])
In a matrix defined as 'numpy', called 'solutions', each row represents 'solutions[0]','solutions[1]', 'solutions[i]'... In addition, the 'coeff' is also defined as 'numpy', and the 'coeff[0]','coeff[1]','coeff[i]'... is matched to 'solutions[0]','solutions[1]','solutions[i]'...
What I want at this time is to find specific 'solution[i]' and 'coeff[i]' where all elements of solutions[i] are less than 10^-10 and all elements of coeff[i] are greater than 10^-3.
I wonder if there is an appropriate code to extract a list array in a situation that meets more than one condition. I'm a Python initiator, so please excuse me.
This can be accomplished using advanced indexing:
solution_valid = np.all(solutions < 10e-10, axis=1)
coeff_valid = np.all(coeff > 1e-3, axis=1)
both_valid = coeff_valid & solution_valid
valid_solutions = solutions[both_valid]
valid_coeffs = coeff[both_valid]
but perhaps you mean that the absolute value should be greater or below a certain threshold?
solution_valid = np.all(np.abs(solutions) < 10e-10, axis=1)
coeff_valid = np.all(np.abs(coeff) > 1e-3, axis=1)
both_valid = coeff_valid & solution_valid
valid_solutions = solutions[both_valid]
valid_coeffs = coeff[both_valid]

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Ruby: Extract elements from deeply nested JSON structure based on criteria

Want to extract every marketID from every market that has a marketName == 'Moneyline'. Tried a few combinations of .maps, .rejects, and/or .selects but can't narrow it down as the complicated structure is confusing me.
There are many markets in events, and there are many events as well. A sample of the structure (tried to edit it for brevity):
{"currencyCode"=>"GBP",
"eventTypes"=>[
{"eventTypeId"=>6423,
"eventNodes"=>[
{"eventId"=>28017227,
"event"=>
{"eventName"=>"Philadelphia # Seattle"
},
"marketNodes"=>[
{"marketId"=>"1.128274650",
"description"=>
{"marketName"=>"Moneyline"}
},
{"marketId"=>"1.128274625",
"description"=>
{"marketName"=>"Winning Margin"}
}}}]},
{"eventId"=>28018251,
"event"=>
{"eventName"=>"Arkansas # Mississippi State"
},
"marketNodes"=>[
{"marketId"=>"1.128299882",
"description"=>
{"marketName"=>"Under/Over 60.5pts"}
},
{"marketId"=>"1.128299881",
"description"=>
{"marketName"=>"Moneyline"}
}}}]},
{"eventId"=> etc....
Tried all kinds of things, for example,
markets = json["eventTypes"].first["eventNodes"].map {|e| e["marketNodes"].map { |e| e["marketId"] } if (e["marketNodes"].map {|e| e["marketName"] == 'Moneyline'})}
markets.flatten
# => yields every marketId not every marketId with marketName of 'Moneyline'
Getting a simple array with every marketId from Moneyline markets with no other information is sufficient. Using Rails methods is fine too if preferred.
Sorry if my editing messed up the syntax. Here's the source. It looks like this only with => instead of : after parsing the JSON.
Thank you!
I love nested maps and selects :D
require 'json'
hash = JSON.parse(File.read('data.json'))
moneyline_market_ids = hash["eventTypes"].map{|type|
type["eventNodes"].map{|node|
node["marketNodes"].select{|market|
market["description"]["marketName"] == 'Moneyline'
}.map{|market| market["marketId"]}
}
}.flatten
puts moneyline_market_ids.join(', ')
#=> 1.128255531, 1.128272164, 1.128255516, 1.128272159, 1.128278718, 1.128272176, 1.128272174, 1.128272169, 1.128272148, 1.128272146, 1.128255464, 1.128255448, 1.128272157, 1.128272155, 1.128255499, 1.128272153, 1.128255484, 1.128272150, 1.128255748, 1.128272185, 1.128278720, 1.128272183, 1.128272178, 1.128255729, 1.128360712, 1.128255371, 1.128255433, 1.128255418, 1.128255403, 1.128255387
Just for fun, here's another possible answer, this time with regexen. It is shorter but might break depending on your input data. It reads the json data directly as String :
json = File.read('data.json')
market_ids = json.scan(/(?<="marketId":")[\d\.]+/)
market_names = json.scan(/(?<="marketName":")[^"]+/)
moneyline_market_ids = market_ids.zip(market_names).select{|id,name| name=="Moneyline"}.map{|id,_| id}
puts moneyline_market_ids.join(', ')
#=> 1.128255531, 1.128272164, 1.128255516, 1.128272159, 1.128278718, 1.128272176, 1.128272174, 1.128272169, 1.128272148, 1.128272146, 1.128255464, 1.128255448, 1.128272157, 1.128272155, 1.128255499, 1.128272153, 1.128255484, 1.128272150, 1.128255748, 1.128272185, 1.128278720, 1.128272183, 1.128272178, 1.128255729, 1.128360712, 1.128255371, 1.128255433, 1.128255418, 1.128255403, 1.128255387
It outputs the same result as the other answer.

Julia plotting unknown number of layers in Gadfly

I'm trying to create a plot in Julia (currently using Gadfly but I'd be willing to use a different package). I have a multidimensional array. For a fixed dimension size (e.g. 4875x3x3 an appropriate plot would be:
p=Gadfly.plot(
layer(y=sim1.value[:,1,1],x=[sim1.range],Geom.line, Theme(default_color=color("red"))),
layer(y=sim1.value[:,1,2],x=[sim1.range],Geom.line, Theme(default_color=color("blue"))),
layer(y=sim1.value[:,1,3],x=[sim1.range],Geom.line, Theme(default_color=color("green")))
)
but in general I want to be able to write a plot statement where I do not know the third dimension of the sim1.value array. How can I write such a statement?
perhaps something like:
p=Gadfly.plot([layer(y=sim1.value[:,1,i],x=[sim1.range], Geom.line, Theme(default_color=color("red"))) for i in 1:size(sim1)[3]])
but this doesn't work.
I was able to solve this problem by reshaping the array into a dataframe and adding a column to indicate what the third dimension is, but I was wondering if there was a way to do this without creating a dataframe.
Data look something like this:
julia> sim1.value
4875x3x3 Array{Float64,3}:
[:, :, 1] =
0.201974 0.881742 0.497407
0.0751914 0.921308 0.732588
-0.109084 1.06304 1.15962
-0.0149133 0.896267 1.22897
0.717094 0.72558 0.456043
0.971697 0.792255 0.40328
0.971697 0.792255 0.227884
-0.600564 1.23815 0.499631
-0.881391 1.07994 0.59905
-0.530923 1.00278 0.447363
⋮
0.866138 0.657875 0.280823
1.00881 0.594015 0.894645
0.470741 0.859117 1.09108
0.919887 0.540488 1.01126
2.22095 0.194968 0.954895
2.5013 0.202698 2.05665
1.94958 0.257192 2.01836
2.24015 0.209885 1.67657
0.76246 0.739945 2.2389
0.673887 0.640661 2.15134
[:, :, 2] =
1.28742 0.760712 1.61112
2.21436 0.229947 1.87528
-1.66456 1.46374 1.94794
-2.4864 1.84093 2.34668
-2.79278 1.61191 2.22896
-1.46289 1.21712 1.96906
-0.580682 1.3222 1.45223
0.17112 1.20572 0.74517
0.734113 0.629927 1.43462
1.29676 0.266065 1.52497
⋮
1.2871 0.595874 0.195617
1.84438 0.383567 1.15537
2.12446 0.520074 0.957211
2.36307 0.222486 0.402168
2.43727 0.19843 0.636037
2.33525 0.302378 0.811371
1.09497 0.605816 0.297978
1.366 0.56246 0.343701
1.366 0.56246 0.219561
1.35889 0.630971 0.281955
[:, :, 3] =
0.649675 0.899028 0.628103
0.718837 0.665043 0.153844
0.914646 0.807048 0.207743
0.612839 0.790611 0.293676
0.759457 0.758115 0.280334
0.77993 0.774677 0.396879
-1.63825 1.38275 0.85772
-1.43517 1.45871 0.835853
-1.15413 1.35757 1.05071
-1.10967 1.37525 0.685986
⋮
1.15299 0.561492 0.680718
1.14853 0.629728 0.294947
1.65147 0.517422 0.22285
1.65147 0.517422 0.517451
1.78835 0.719658 0.745866
2.36554 0.426616 1.49432
0.855502 0.739237 1.24224
-0.175234 0.701025 1.07798
-0.221313 0.939255 1.3463
1.58094 0.368615 1.63817
Apparently "splatting", if that's the correct term, works here. Try:
p=Gadfly.plot([layer(y=sim1.value[:,1,i],x=[sim1.range], Geom.line, Theme(default_color=color("red"))) for i in 1:size(sim1)[3]]...)
For different layer colors, this is just a guess/hack (feel free to edit for correctness).
p=Gadfly.plot([layer(y=sim1.value[:,1,i],x=[sim1.range], Geom.line, Theme(default_color=color(["red" "blue" "green" "cyan" "magenta" "yellow"][i%6+1]))) for i in 1:size(sim1)[3]]...)
Perhaps one of Gadfly's Scale color parameters would help here.
Addendum:
See first comment below for color selection method.

Resources