Accuracy 100% with grid search? What is wrong? - logistic-regression

When I run Logistic Regression with default parameters accuracy = 98%
After grid search accuracy = 100%. Is it possible? Could this happen because of bad dataset? here is my code so far.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
sfk = StratifiedKFold(n_splits=5, random_state=21, shuffle=True)
lr = LogisticRegression()
lr_param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'penalty': ['l1', 'l2'],
'max_iter': list(range(100,800,100)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
lr_search = GridSearchCV(estimator=lr, param_grid=lr_param_grid,cv=sfk)
lr_search.fit(X_train, y_train)
lr_test_predicitons = lr_search.predict(X_test)
print(metrics.classification_report(y_test, lr_test_predicitons))
print('Mean Accuracy (test set): %.3f' % accuracy_score(y_test,lr_test_predicitons))```
precision recall f1-score support
0 1.00 1.00 1.00 155
1 1.00 1.00 1.00 57
accuracy 1.00 212
macro avg 1.00 1.00 1.00 212
weighted avg 1.00 1.00 1.00 212

Grid Search Alghoritm finds the parameters to obtain the best accuracy. Morover it has the default parameter euqals True, so it can happen and if the dataset it very small it is very likely

Related

KNN Algorithm is Giving good Accuracy with Bad Confusion Matrix Results

I have data with multilabel classification. I used KNN model in order to classify it. The number of labels are 15, I got accuracy results for each label, averaged the results to get the accuracy of the model which is 93%.
The confusion matrix is showing bad numbers.
Would you tell me what does this mean? Is it overfitting? How can I solve my problem?
Accuracy and mean absolute error (mae) code
Input:
# Getting the accuracy of the model
y_pred1 = level_1_knn_model.predict(X_val1)
accuracy = (sum(y_val1==y_pred1)/y_val1.shape[0])*100
accuracy = sum(accuracy)/len(accuracy)
print("Accuracy: "+str(accuracy)+"%\n")
# Getting the mean absolute error
mae1 = mean_absolute_error(y_val1, y_pred1)
print("Mean Absolute Error: "+str(mae1))
Output:
Accuracy: [96.55462575 97.82146336 99.23207908 95.39247451 98.69340807 74.22793801
78.67975909 97.47825108 99.80189098 77.67264969 91.69399776 99.97084683
99.42621267 99.32682688 99.74159693]%
Accuracy: 93.71426804569977%
Mean Absolute Error: 9.703818402273944
Confusion Matrix and classification report code
Input:
# Calculate the confusion matrix
cMatrix1 = confusion_matrix(y_val1.argmax(axis=1), y_pred1.argmax(axis=1))
# Plot the confusion matrix
plt.figure(figsize=(11,10))
sns.heatmap(cMatrix1, annot=True, fmt='g')
# Calculate the classification report
classReport1 = classification_report(y_val1, y_pred1)
print("\nClassification Report:")
print(classReport1)
Output:
Classification Report:
precision recall f1-score support
0 0.08 0.00 0.01 5053
1 0.03 0.00 0.01 3017
2 0.00 0.00 0.00 1159
3 0.07 0.00 0.01 6644
4 0.00 0.00 0.00 1971
5 0.58 0.65 0.61 47222
6 0.39 0.33 0.36 27302
7 0.02 0.00 0.00 3767
8 0.00 0.00 0.00 299
9 0.58 0.61 0.60 40823
10 0.13 0.02 0.03 11354
11 0.00 0.00 0.00 44
12 0.00 0.00 0.00 866
13 0.00 0.00 0.00 1016
14 0.00 0.00 0.00 390
micro avg 0.54 0.43 0.48 150927
macro avg 0.13 0.11 0.11 150927
weighted avg 0.43 0.43 0.42 150927
samples avg 0.43 0.43 0.43 150927

How do I correctly add a chain ID to my pdb file?

I am trying to conduct some analysis with my single-chain PDB file (766 residues long), but it requires a chain ID. Currently, there isn't one.
Here is a snippet of the pdb file:
ATOM 1 N MET 1 -69.269 78.953 -91.441 1.00 0.00 N
ATOM 2 CA MET 1 -69.264 78.650 -92.891 1.00 0.00 C
ATOM 4 C MET 1 -69.371 79.939 -93.633 1.00 0.00 C
ATOM 5 O MET 1 -68.379 80.649 -93.799 1.00 0.00 O
ATOM 3 CB MET 1 -70.475 77.774 -93.251 1.00 0.00 C
ATOM 6 CG MET 1 -70.505 76.455 -92.477 1.00 0.00 C
ATOM 7 SD MET 1 -69.115 75.332 -92.806 1.00 0.00 S
ATOM 8 CE MET 1 -69.473 74.270 -91.377 1.00 0.00 C
ATOM 9 N ASP 2 -70.583 80.284 -94.111 1.00 0.00 N
ATOM 10 CA ASP 2 -70.688 81.539 -94.789 1.00 0.00 C
ATOM 12 C ASP 2 -70.661 82.602 -93.737 1.00 0.00 C
ATOM 13 O ASP 2 -71.088 82.377 -92.606 1.00 0.00 O
ATOM 11 CB ASP 2 -71.963 81.733 -95.626 1.00 0.00 C
ATOM 14 CG ASP 2 -71.691 82.908 -96.557 1.00 0.00 C
ATOM 15 OD1 ASP 2 -70.569 82.953 -97.130 1.00 0.00 O
ATOM 16 OD2 ASP 2 -72.598 83.768 -96.717 1.00 0.00 O1-
ATOM 17 N HIS 3 -70.129 83.791 -94.077 1.00 0.00 N
ATOM 18 CA HIS 3 -70.045 84.846 -93.110 1.00 0.00 C
ATOM 20 C HIS 3 -71.342 85.581 -93.094 1.00 0.00 C
ATOM 21 O HIS 3 -72.113 85.574 -94.052 1.00 0.00 O
ATOM 19 CB HIS 3 -68.925 85.865 -93.404 1.00 0.00 C
ATOM 23 CG HIS 3 -68.749 86.908 -92.336 1.00 0.00 C
ATOM 25 CD2 HIS 3 -67.998 86.879 -91.200 1.00 0.00 C
ATOM 22 ND1 HIS 3 -69.357 88.144 -92.351 1.00 0.00 N
ATOM 26 CE1 HIS 3 -68.947 88.797 -91.234 1.00 0.00 C
ATOM 24 NE2 HIS 3 -68.121 88.068 -90.504 1.00 0.00 N
What's the best way for me to label the chain as chain A?
Here's the answer.
We need to read the file line by line and put a chain into column 22 of each line that begins with ATOM. Assuming the file is called myfile.pdb, we are trying to replace the empty space that is separated by 17 characters from ATOM with the letter A. This can be accomplished with a relatively simple sed command.
sed 's/^\(ATOM.\{17\}\) /\1A/' myfile.pdb > newfile.pdb
Hope this is helpful!

SAS: Using a Loop for Creating Many Data Sets and renaming the variables in them

I have a dataset in a long format as e.g.:
time subject var1 var2 var3
1 1 0.41 0.48 0.85
2 1 0.58 0.38 0.15
3 1 0.08 0.39 0.96
4 1 0.58 0.87 0.15
5 1 0.55 0.40 0.67
1 2 0.76 0.49 0.03
2 2 0.36 0.26 0.93
3 2 0.83 0.88 0.63
4 2 0.19 0.65 0.99
5 2 0.89 0.91 0.47
I would like to get a dataset in a wide format as
time var1_sub1 var2_sub1 var3_sub1 var1_sub2 var2_sub2 var3_sub2
1 0.41 0.48 0.85 0.76 0.49 0.03
2 0.58 0.38 0.15 0.36 0.26 0.93
3 0.08 0.39 0.96 0.83 0.88 0.63
4 0.58 0.87 0.15 0.19 0.65 0.99
5 0.55 0.40 0.67 0.89 0.91 0.47
So far, I came up with an idea to do it in the following way:
data data_sub1;
set data;
if subject=1;
var1_sub1=var1;
var2_sub1=var2;
var3_sub1=var3;
run;
data data_sub2;
set data;
if subject=2;
var1_sub2=var1;
var2_sub2=var2;
var3_sub2=var3;
run;
proc sort data=data_sub1;
by time;
run;
proc sort data=data_sub2;
by time;
run;
data datamerged;
merge data_sub1 data_sub2;
by time;
run;
It works, everything is fine, but I would like to learn how one could code it in a more beautiful way as in the practice I have much more subjects and variables.
This is a PROC TRANSPOSE problem. To solve most PROC TRANSPOSE problems, make it totally vertical (one value-one variable name per row) and then transpose using the ID statement.
data have;
input time subject var1 var2 var3;
datalines;
1 1 0.41 0.48 0.85
2 1 0.58 0.38 0.15
3 1 0.08 0.39 0.96
4 1 0.58 0.87 0.15
5 1 0.55 0.40 0.67
1 2 0.76 0.49 0.03
2 2 0.36 0.26 0.93
3 2 0.83 0.88 0.63
4 2 0.19 0.65 0.99
5 2 0.89 0.91 0.47
;;;;
run;
data have_vert;
set have;
array vars var:;
do _t = 1 to dim(vars);
id=cats(vname(vars[_t]),'_','sub',subject); *this is our future variable name;
value = vars[_t]; *this is our future variable value;
output;
end;
keep time id value subject;
run;
proc sort data=have_vert;
by time subject id;
run;
proc transpose data=have_vert out=want;
by time;
var value;
id id;
run;

R extension in C, setting matrix row/column names

I'm writing an R package that manipulates Matrices in C. Currently, the matrices returned to R have numbers for the row/column names. I would rather assign my own row/column names when modifying the object in C.
I've googled around for about an hour, but haven't found a good solution yet. The closest I've found is dimnames, but I want to name each column, not just the two dimensions. The matrices get larger than 4x4, below is just a small example of what I want to do.
The number of rows is 4^x where X is the length of the row name
Current
[,1] [,2] [,3] [,4]
[1,] 0.20 0.00 0.00 0.80
[2,] 0.25 0.25 0.25 0.25
[3,] 0.25 0.25 0.25 0.25
[4,] 1.00 0.00 0.00 0.00
[5,] 0.20 0.00 0.00 0.80
[6,] 0.25 0.25 0.25 0.25
[7,] 0.25 0.25 0.25 0.25
[8,] 1.00 0.00 0.00 0.00
[9,] 0.20 0.00 0.00 0.80
[10,] 0.25 0.25 0.25 0.25
[11,] 0.25 0.25 0.25 0.25
[12,] 1.00 0.00 0.00 0.00
[13,] 0.20 0.00 0.00 0.80
[14,] 0.25 0.25 0.25 0.25
[15,] 0.25 0.25 0.25 0.25
[16,] 1.00 0.00 0.00 0.00
Desired
[A] [C] [G] [T]
[AA] 0.20 0.00 0.00 0.80
[AC] 0.25 0.25 0.25 0.25
[AG] 0.25 0.25 0.25 0.25
[AT] 1.00 0.00 0.00 0.00
[CA] 0.20 0.00 0.00 0.80
[CC] 0.25 0.25 0.25 0.25
[CG] 0.25 0.25 0.25 0.25
[CT] 1.00 0.00 0.00 0.00
[GA] 0.20 0.00 0.00 0.80
[GC] 0.25 0.25 0.25 0.25
[GG] 0.25 0.25 0.25 0.25
[GT] 1.00 0.00 0.00 0.00
[TA] 0.20 0.00 0.00 0.80
[TC] 0.25 0.25 0.25 0.25
[TG] 0.25 0.25 0.25 0.25
[TT] 1.00 0.00 0.00 0.00
If you are open to C++ instead of C, then Rcpp can make this a little easier. We just create a list object with rows and column names as we would in R, and assign that to the dimnames attribute of the matrix object:
R> library(inline) # to compile, link, load the code here
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ // C++0x can assign a set of values to a vector, but we use older standard
+ Rcpp::CharacterVector rows(2); rows[0] = "aa"; rows[1] = "bb";
+ Rcpp::CharacterVector cols(2); cols[0] = "AA"; cols[1] = "BB";
+ // now create an object "dimnms" as a list with rows and cols
+ Rcpp::List dimnms = Rcpp::List::create(rows, cols);
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
AA BB
aa 42 42
bb 42 42
R>
The actual assignment of the column and row names is so manual ... because the current C++ standard does not allow direct assignment of vectors at initialization, but that will change.
Edit: I just realized that I can of course use static create() method on the row and colnames too, which makes this a little easier and shorter still
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ Rcpp::List dimnms = // two vec. with static names
+ Rcpp::List::create(Rcpp::CharacterVector::create("cc", "dd"),
+ Rcpp::CharacterVector::create("ee", "ff"));
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
ee ff
cc 42 42
dd 42 42
R>
So we are down to three or four statements, no monkeying with PROTECT / UNPROTECT and no memory management.
As Jim said, this is much easier to do in R. I'm passing the names into the C function via the nam argument.
#include <Rinternals.h>
SEXP myMat(SEXP nam) {
/*PrintValue(nam);*/
SEXP ans, dimnames;
PROTECT(ans = allocMatrix(REALSXP, length(nam), length(nam)));
PROTECT(dimnames = allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 0, nam);
SET_VECTOR_ELT(dimnames, 1, nam);
setAttrib(ans, R_DimNamesSymbol, dimnames);
UNPROTECT(2);
return(ans);
}
If you put that code in a file called myMat.c, you can test it via the line below. I'm using Ubuntu, so you will have to change myMat.so to myMat.dll if you're on Windows.
R CMD SHLIB myMat.c
Rscript -e 'dyn.load("myMat.so"); .Call("myMat", c("A","C","G","T"))'
The note above is instructive. The dimnames is a list with the same number of elements as dimensions of the dataset, where each element corresponds to the number elements along that dimension, i.e., list(c('a','c','g','t'), c('a','c','g','t')).
To set that in C, I would recommend:
PROTECT(dimnames = allocVector(VECSXP, 2));
PROTECT(rownames = allocVector(STRSXP, 4));
PROTECT(colnames = allocVector(STRSXP, 4));
setAttrib( ? , R_DimNamesSymbol, dimnames);
You'll have to then set the relevant rowname and colname elements. In general, this stuff is much easier to do in R.
jim

Why is gmtime implemented this way?

I happened across the source for Minix's gmtime function. I was interested in the bit that calculated the year number from days since epoch. Here are the guts of that bit:
http://www.raspberryginger.com/jbailey/minix/html/gmtime_8c-source.html
http://www.raspberryginger.com/jbailey/minix/html/loc__time_8h-source.html
#define EPOCH_YR 1970
#define LEAPYEAR(year) (!((year) % 4) && (((year) % 100) || !((year) % 400)))
#define YEARSIZE(year) (LEAPYEAR(year) ? 366 : 365)
int year = EPOCH_YR;
while (dayno >= YEARSIZE(year)) {
dayno -= YEARSIZE(year);
year++;
}
It looks like the algorithm is O(n), where n is the distance from the epoch. Additionally, it seems that LEAPYEAR must be calculated separately for each year – dozens of times for current dates and many more for dates far in the future. I had the following algorithm for doing the same thing (in this case from the ISO-9601 epoch (Year 0 = 1 BC) rather than UNIX epoch):
#define CYCLE_1 365
#define CYCLE_4 (CYCLE_1 * 4 + 1)
#define CYCLE_100 (CYCLE_4 * 25 - 1)
#define CYCLE_400 (CYCLE_100 * 4 + 1)
year += 400 * (dayno / CYCLE_400)
dayno = dayno % CYCLE_400
year += 100 * (dayno / CYCLE_100)
dayno = dayno % CYCLE_100
year += 4 * (dayno / CYCLE_4)
dayno = dayno % CYCLE_4
year += 1 * (dayno / CYCLE_1)
dayno = dayno % CYCLE_1
This runs in O(1) for any date, and looks like it should be faster even for dates reasonably close to 1970.
So, assuming that the Minix developers are Smart People who did it their way for a Reason, and probably know a bit more about C than I do, why?
Ran your code as y2 minix code as y1 Solaris 9 v245 & got this profiler data:
%Time Seconds Cumsecs #Calls msec/call Name
79.1 0.34 0.34 36966 0.0092 _write
7.0 0.03 0.37 1125566 0.0000 .rem
7.0 0.03 0.40 36966 0.0008 _doprnt
4.7 0.02 0.42 1817938 0.0000 _mcount
2.3 0.01 0.43 36966 0.0003 y2
0.0 0.00 0.43 4 0. atexit
0.0 0.00 0.43 1 0. _exithandle
0.0 0.00 0.43 1 0. main
0.0 0.00 0.43 1 0. _fpsetsticky
0.0 0.00 0.43 1 0. _profil
0.0 0.00 0.43 36966 0.0000 printf
0.0 0.00 0.43 147864 0.0000 .div
0.0 0.00 0.43 73932 0.0000 _ferror_unlocked
0.0 0.00 0.43 36966 0.0000 memchr
0.0 0.00 0.43 1 0. _findbuf
0.0 0.00 0.43 1 0. _ioctl
0.0 0.00 0.43 1 0. _isatty
0.0 0.00 0.43 73932 0.0000 _realbufend
0.0 0.00 0.43 36966 0.0000 _xflsbuf
0.0 0.00 0.43 1 0. _setbufend
0.0 0.00 0.43 1 0. _setorientation
0.0 0.00 0.43 137864 0.0000 _memcpy
0.0 0.00 0.43 3 0. ___errno
0.0 0.00 0.43 1 0. _fstat64
0.0 0.00 0.43 1 0. exit
0.0 0.00 0.43 36966 0.0000 y1
Maybe that is an answer
This is pure speculation, but perhaps MINIX had requirements that were more important than execution speed, such as simplicity, ease of understanding, and conciseness? Some of the code was printed in a textbook, after all.
Your method seems sound, but it's a little more difficult to get it to work for EPOCH_YR = 1970 because you are now mid-cycle on several cycles.
Can you see if you have an equivalent for that case and see whether it's still better?
You're certainly right that it's debatable whether that gmtime() implementation should be used in any high-performance code. That's a lot of busy work to be doing in any tight loops.
Correct approach. You definitely want to go for an O(1) algo. Would work in Mayan calendar without ado. Check the last line: dayno is limited to 0..364, although in leap years it needs to range 0..365 . The line before has a similar flaw.

Resources