I have data with multilabel classification. I used KNN model in order to classify it. The number of labels are 15, I got accuracy results for each label, averaged the results to get the accuracy of the model which is 93%.
The confusion matrix is showing bad numbers.
Would you tell me what does this mean? Is it overfitting? How can I solve my problem?
Accuracy and mean absolute error (mae) code
Input:
# Getting the accuracy of the model
y_pred1 = level_1_knn_model.predict(X_val1)
accuracy = (sum(y_val1==y_pred1)/y_val1.shape[0])*100
accuracy = sum(accuracy)/len(accuracy)
print("Accuracy: "+str(accuracy)+"%\n")
# Getting the mean absolute error
mae1 = mean_absolute_error(y_val1, y_pred1)
print("Mean Absolute Error: "+str(mae1))
Output:
Accuracy: [96.55462575 97.82146336 99.23207908 95.39247451 98.69340807 74.22793801
78.67975909 97.47825108 99.80189098 77.67264969 91.69399776 99.97084683
99.42621267 99.32682688 99.74159693]%
Accuracy: 93.71426804569977%
Mean Absolute Error: 9.703818402273944
Confusion Matrix and classification report code
Input:
# Calculate the confusion matrix
cMatrix1 = confusion_matrix(y_val1.argmax(axis=1), y_pred1.argmax(axis=1))
# Plot the confusion matrix
plt.figure(figsize=(11,10))
sns.heatmap(cMatrix1, annot=True, fmt='g')
# Calculate the classification report
classReport1 = classification_report(y_val1, y_pred1)
print("\nClassification Report:")
print(classReport1)
Output:
Classification Report:
precision recall f1-score support
0 0.08 0.00 0.01 5053
1 0.03 0.00 0.01 3017
2 0.00 0.00 0.00 1159
3 0.07 0.00 0.01 6644
4 0.00 0.00 0.00 1971
5 0.58 0.65 0.61 47222
6 0.39 0.33 0.36 27302
7 0.02 0.00 0.00 3767
8 0.00 0.00 0.00 299
9 0.58 0.61 0.60 40823
10 0.13 0.02 0.03 11354
11 0.00 0.00 0.00 44
12 0.00 0.00 0.00 866
13 0.00 0.00 0.00 1016
14 0.00 0.00 0.00 390
micro avg 0.54 0.43 0.48 150927
macro avg 0.13 0.11 0.11 150927
weighted avg 0.43 0.43 0.42 150927
samples avg 0.43 0.43 0.43 150927
Currently learning C and in the textbook problem I am working on I have hit a slump. The question was Define a function that returns the value x^2 - y^2, and print out a 11 x 11 grid for values of x and y ranging from -1 to 1 using a function. I was able to finish the first row but I am having trouble with the other rows. The correct output to this problem is
0.00 0.36 0.64 0.84 0.96 1.00 0.96 0.84 0.64 0.36 0.00
-0.36 0.00 0.28 0.48 0.60 0.64 0.60 0.48 0.28 0.00 -0.36
-0.64 -0.28 0.00 0.20 0.32 0.36 0.32 0.20 -0.00 -0.28 -0.64
-0.84 -0.48 -0.20 0.00 0.12 0.16 0.12 -0.00 -0.20 -0.48 -0.84
-0.96 -0.60 -0.32 -0.12 0.00 0.04 -0.00 -0.12 -0.32 -0.60 -0.96
-1.00 -0.64 -0.36 -0.16 -0.04 0.00 -0.04 -0.16 -0.36 -0.64 -1.00
-0.96 -0.60 -0.32 -0.12 0.00 0.04 0.00 -0.12 -0.32 -0.60 -0.96
-0.84 -0.48 -0.20 0.00 0.12 0.16 0.12 0.00 -0.20 -0.48 -0.84
-0.64 -0.28 0.00 0.20 0.32 0.36 0.32 0.20 0.00 -0.28 -0.64
-0.36 0.00 0.28 0.48 0.60 0.64 0.60 0.48 0.28 0.00 -0.36
0.00 0.36 0.64 0.84 0.96 1.00 0.96 0.84 0.64 0.36 0.00
So far in my code I have
double y=1;
int count =0;
double xSq;
double origX = x;
double origY = y;
double ySq;
xSq = x * x;
ySq = y * y;
double update;
for (int i =0; i < 11; i++){
double sum = xSq - ySq;
printf("%f\t", sum);
count++;
y = y - 0.2;
ySq = y * y;
}
I think what they are wanting you to do is different. They want you to first makea function that returns the diff between squares.
Then, they want you to use that function from a 2 level loop (in main) which vary the values of x and y respectively.
The values x and y each go from -1 to +1.
Inside the 2 level loop, you'd call your function with the then current values of x, y and get the result. Then, you'd print x, y and the result.
You'll figure out how to add a line after one level of the loop so that you get your rows.
I used gprof to get a profile of a c code which is running too slowly. Here is what I get:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.05 0.16 0.16 etext
0.00 0.16 0.00 90993 0.00 0.00 Nel_wind
0.00 0.16 0.00 27344 0.00 0.00 calc_crab_dens
0.00 0.16 0.00 17472 0.00 0.00 Nel_radio
0.00 0.16 0.00 1786 0.00 0.00 sync
0.00 0.16 0.00 1 0.00 0.00 _fini
0.00 0.16 0.00 1 0.00 0.00 calc_ele
0.00 0.16 0.00 1 0.00 0.00 ic
0.00 0.16 0.00 1 0.00 0.00 initialize
0.00 0.16 0.00 1 0.00 0.00 make_table
I don't know what does "etext" mean and why is it taking 100.05% of time running. Thanks for your help!
I was having a similar issue and it was caused by me calling gprof with a different executable.
The accident occurred because I was recompiling with different options and naively called gprof with the same executable name on two different gmon.out files that were generated with different executables.
gprof exec1 exec1.gmon.out # Good, expected output
gprof exec1 exec2.gmon.out # Weird etext function with 0 calls, but lots of time consumed
Make sure you're not doing something similar.
I have below output from gprof for my program:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 30002 0.00 0.00 insert
0.00 0.00 0.00 10124 0.00 0.00 getNode
0.00 0.00 0.00 3000 0.00 0.00 search
0.00 0.00 0.00 1 0.00 0.00 initialize
I have done optimizations and the run time I have is 0.01 secs(this is being calculated on a server where I'm uploading my code) which is the least I am getting at the moment. I am not able to reduce it further, though I want to. Does the 0.01 sec run time of my program has anything to do with the sampling time I see above in gprof output.
Call graph is as below:
gprof -q ./a.out gmon.out
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
0.00 0.00 30002/30002 main [10]
[1] 0.0 0.00 0.00 30002 insert [1]
0.00 0.00 10124/10124 getNode [2]
-----------------------------------------------
0.00 0.00 10124/10124 insert [1]
[2] 0.0 0.00 0.00 10124 getNode [2]
-----------------------------------------------
0.00 0.00 3000/3000 main [10]
[3] 0.0 0.00 0.00 3000 search [3]
-----------------------------------------------
0.00 0.00 1/1 main [10]
[4] 0.0 0.00 0.00 1 initialize [4]
-----------------------------------------------
While using `time /bin/sh -c ' ./a.out < inp.in '` on my machine I get below which varies slightly on every run .
real 0m0.024s
user 0m0.016s
sys 0m0.004s
real 0m0.017s
user 0m0.008s
sys 0m0.004s
I am bit confused how to correlate time output and gprof o/p
According to your other question, you got it from 8 seconds down to 0.01 seconds.
That's pretty good.
Now if you want to go further, first do as #Peter suggested in his comment.
Run the code many times inside main() so it runs long enough to get samples.
Then you could try my favorite technique.
It will be much more informative than gprof.
P.S. Don't worry about CPU percent.
All it tells is if your machine is busy and not doing much I/O.
It does not tell you anything about your program.
I'm writing an R package that manipulates Matrices in C. Currently, the matrices returned to R have numbers for the row/column names. I would rather assign my own row/column names when modifying the object in C.
I've googled around for about an hour, but haven't found a good solution yet. The closest I've found is dimnames, but I want to name each column, not just the two dimensions. The matrices get larger than 4x4, below is just a small example of what I want to do.
The number of rows is 4^x where X is the length of the row name
Current
[,1] [,2] [,3] [,4]
[1,] 0.20 0.00 0.00 0.80
[2,] 0.25 0.25 0.25 0.25
[3,] 0.25 0.25 0.25 0.25
[4,] 1.00 0.00 0.00 0.00
[5,] 0.20 0.00 0.00 0.80
[6,] 0.25 0.25 0.25 0.25
[7,] 0.25 0.25 0.25 0.25
[8,] 1.00 0.00 0.00 0.00
[9,] 0.20 0.00 0.00 0.80
[10,] 0.25 0.25 0.25 0.25
[11,] 0.25 0.25 0.25 0.25
[12,] 1.00 0.00 0.00 0.00
[13,] 0.20 0.00 0.00 0.80
[14,] 0.25 0.25 0.25 0.25
[15,] 0.25 0.25 0.25 0.25
[16,] 1.00 0.00 0.00 0.00
Desired
[A] [C] [G] [T]
[AA] 0.20 0.00 0.00 0.80
[AC] 0.25 0.25 0.25 0.25
[AG] 0.25 0.25 0.25 0.25
[AT] 1.00 0.00 0.00 0.00
[CA] 0.20 0.00 0.00 0.80
[CC] 0.25 0.25 0.25 0.25
[CG] 0.25 0.25 0.25 0.25
[CT] 1.00 0.00 0.00 0.00
[GA] 0.20 0.00 0.00 0.80
[GC] 0.25 0.25 0.25 0.25
[GG] 0.25 0.25 0.25 0.25
[GT] 1.00 0.00 0.00 0.00
[TA] 0.20 0.00 0.00 0.80
[TC] 0.25 0.25 0.25 0.25
[TG] 0.25 0.25 0.25 0.25
[TT] 1.00 0.00 0.00 0.00
If you are open to C++ instead of C, then Rcpp can make this a little easier. We just create a list object with rows and column names as we would in R, and assign that to the dimnames attribute of the matrix object:
R> library(inline) # to compile, link, load the code here
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ // C++0x can assign a set of values to a vector, but we use older standard
+ Rcpp::CharacterVector rows(2); rows[0] = "aa"; rows[1] = "bb";
+ Rcpp::CharacterVector cols(2); cols[0] = "AA"; cols[1] = "BB";
+ // now create an object "dimnms" as a list with rows and cols
+ Rcpp::List dimnms = Rcpp::List::create(rows, cols);
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
AA BB
aa 42 42
bb 42 42
R>
The actual assignment of the column and row names is so manual ... because the current C++ standard does not allow direct assignment of vectors at initialization, but that will change.
Edit: I just realized that I can of course use static create() method on the row and colnames too, which makes this a little easier and shorter still
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ Rcpp::List dimnms = // two vec. with static names
+ Rcpp::List::create(Rcpp::CharacterVector::create("cc", "dd"),
+ Rcpp::CharacterVector::create("ee", "ff"));
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
ee ff
cc 42 42
dd 42 42
R>
So we are down to three or four statements, no monkeying with PROTECT / UNPROTECT and no memory management.
As Jim said, this is much easier to do in R. I'm passing the names into the C function via the nam argument.
#include <Rinternals.h>
SEXP myMat(SEXP nam) {
/*PrintValue(nam);*/
SEXP ans, dimnames;
PROTECT(ans = allocMatrix(REALSXP, length(nam), length(nam)));
PROTECT(dimnames = allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 0, nam);
SET_VECTOR_ELT(dimnames, 1, nam);
setAttrib(ans, R_DimNamesSymbol, dimnames);
UNPROTECT(2);
return(ans);
}
If you put that code in a file called myMat.c, you can test it via the line below. I'm using Ubuntu, so you will have to change myMat.so to myMat.dll if you're on Windows.
R CMD SHLIB myMat.c
Rscript -e 'dyn.load("myMat.so"); .Call("myMat", c("A","C","G","T"))'
The note above is instructive. The dimnames is a list with the same number of elements as dimensions of the dataset, where each element corresponds to the number elements along that dimension, i.e., list(c('a','c','g','t'), c('a','c','g','t')).
To set that in C, I would recommend:
PROTECT(dimnames = allocVector(VECSXP, 2));
PROTECT(rownames = allocVector(STRSXP, 4));
PROTECT(colnames = allocVector(STRSXP, 4));
setAttrib( ? , R_DimNamesSymbol, dimnames);
You'll have to then set the relevant rowname and colname elements. In general, this stuff is much easier to do in R.
jim