Natural language classifier returns classifications for untrained items - ibm-watson

I am confused as to how NLC works. My expectation is that when it is asked to classify text that it should have no relation or training data to learn from it should return no results or results with very low confidence scores.
I have trained a model with a set of training data and when I attempt to classify text that is outside of the training data I am getting results with high confidence values (~60%).
Here's an example of my training data:
foo,1,2,3,4
bar,1,2,3,4
baz,1,2,3,4
When I try to classify the text "This should not exist" I receive a high confidence that this text is "1".
Is my assumption correct in that I should be returned values in this case? Am I training the data to classify foo, bar, and baz incorrectly? If not what should I expect from the NLC service?

Imagine that you have 3 buckets and you have to throw a coin in one of them. Each bucket has 33.3 % changes to get the coin. The same happens with Natural Language Classifier service. It is trained to classify input text into predefined classes.
If you create a classifier with 3 classes and you try to classify text that wasn't in the training data, NLC will still classify your sentence to one of the three classes you defined. If your output is 60% then the other two buckets will get the remaining 40%.
Sometimes you could get a high score and that's normal when you have classes that are very different.

Related

Inconsistent # of Samples NLP IMDB Sentiment Classification

I'm trying to make a sentiment classifier on the IMDB dataset. I am fairly new to NLP and data science, and I keep getting this error while tryng to fit my model.
ValueError: Found input variables with inconsistent numbers of samples: [24745, 40000]
I've looked at many other threads and it all says to reshape your data, which I did, the size of my X variable is (24745, 100) and for my y variables its (40000, 1).
I am currently trying to use any of these models:
MultinomialNB
BernoulliNB
GaussianNB
I've also tried to create a Tensorflow Sequential Model with Bidirectional LSTM but that produced terrible accuracy, it was at 50 for a long time, randomly guessing.
Any help appreciated please.

In Watson Discovery API, Which result shall i use to determine the most relevant documents: score or confidence?

I work on a Discovery collection on which i never made a training.
When i launch a natural language query on my collection, in the result_metadata of the retrieved documents, i see 2 notions: score and confidence
ex:
"confidence": 0.0847209066468392,
"score": 3.4830062
and the tag "retrieval_details" has the value "document_retrieval_strategy": "untrained"
In the documentation, it is first written that "The confidence score will be returned for both trained and untrained private collections" and further that "The confidence score for a result with the document_retrieval_strategy of untrained is an unsupervised estimate of how relevant the document results are to the query; it is not interchangeable with the the score returned for trained collections. A trained collection can provide better answers to natural language queries than untrained collections."
Precisely: what does that mean ? How that confidence score is calculated ? Which result shall i use to get the most relevant documents : score or confidence ?
You need to use confidence. Score should never be used to define thresholds, since it’s a relative calculation.
It’s also recommended to use the “document_retrieval_strategy” as part of the thresholds, having a different threshold to each strategy, or at least one for trained and one for untrained, since the way confidence is treated will be different based on the strategy applied.
This post can give you some ideas on how to define your threshold.

A classifier trained using the technique of word embedding(doc2vec) and logisitic regression misclassify the data

I have a text classification problem in which the data set consist of 16 million records. The data is highly imbalanced and consist of almost 500 classes. I took the approach for word embedding and then used Logistic Regression to build a model in which the input is doc2vec matrix, the accuracy that i achieved was 88% but and recall score and F1-score is closet to 84% but when working on testing data the classifier doesn't perform well for example,
if a text details are like i love traveling it is classified under Travel Tag, but if it encounters the text again the model classify it into some different category like unknown. This an unusual behavior that i have encountered in the model.
Code:
Y = " Love travelling"
tokens = cleanText(Y)
vector = model.infer_vector(tokens.split())
predict = logreg.predict(vector.reshape(1,-1))
print(predict)
Output Expected:
Travel
Actual Output:
Unmapped
Regressor code:
class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
logreg = LogisticRegression(multi_class='multinomial', solver='lbfgs', class_weight="balanced")
logreg.fit(train_vectors_dbow, y_train)
Expected result:
Text Label
1. I like to travel Travel
2. I love to travel Travel
Actual
Text Label
1. I like to travel Travel
2. I love to travel Unmapped

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

House pricing using neural network

I wrote multilayer perceptron implementation (on Python) which is able to classify Iris dataset. It was trained by backpropagation algorithm and uses sigmoid actiovation functions on a hidden and output layers.
But now I want to change it to be able to approximate house price.
(I have dataset of ~300 estates with prices and input parameters like rooms, location etc.)
Now output of my perceptron is in range [0;1]. But as far as I understand if I want to get resulting house price on the output neuron I need to change that activation function somehow right?
Can somebody help me?
I'm new to neural networks
Thanks in advance.
Assuming, for instance, that houses price between $1 and $1,000,000, then you can just map the 0...1 range to the final price range both for the training and for the testing. Just note that 300 estates is a fairly small data set.
To be precise, if a house is $500k, then the target training output becomes 0.5. You can basically divide by your maximum possible home value to get the target training amount. When you get the output value you multiple by the maximum home value to get the predicted price.
So, view the output of the neural network as the percentage of the total cost.

Resources