In Watson Discovery API, Which result shall i use to determine the most relevant documents: score or confidence? - ibm-watson

I work on a Discovery collection on which i never made a training.
When i launch a natural language query on my collection, in the result_metadata of the retrieved documents, i see 2 notions: score and confidence
ex:
"confidence": 0.0847209066468392,
"score": 3.4830062
and the tag "retrieval_details" has the value "document_retrieval_strategy": "untrained"
In the documentation, it is first written that "The confidence score will be returned for both trained and untrained private collections" and further that "The confidence score for a result with the document_retrieval_strategy of untrained is an unsupervised estimate of how relevant the document results are to the query; it is not interchangeable with the the score returned for trained collections. A trained collection can provide better answers to natural language queries than untrained collections."
Precisely: what does that mean ? How that confidence score is calculated ? Which result shall i use to get the most relevant documents : score or confidence ?

You need to use confidence. Score should never be used to define thresholds, since it’s a relative calculation.
It’s also recommended to use the “document_retrieval_strategy” as part of the thresholds, having a different threshold to each strategy, or at least one for trained and one for untrained, since the way confidence is treated will be different based on the strategy applied.
This post can give you some ideas on how to define your threshold.

Related

Natural language classifier returns classifications for untrained items

I am confused as to how NLC works. My expectation is that when it is asked to classify text that it should have no relation or training data to learn from it should return no results or results with very low confidence scores.
I have trained a model with a set of training data and when I attempt to classify text that is outside of the training data I am getting results with high confidence values (~60%).
Here's an example of my training data:
foo,1,2,3,4
bar,1,2,3,4
baz,1,2,3,4
When I try to classify the text "This should not exist" I receive a high confidence that this text is "1".
Is my assumption correct in that I should be returned values in this case? Am I training the data to classify foo, bar, and baz incorrectly? If not what should I expect from the NLC service?
Imagine that you have 3 buckets and you have to throw a coin in one of them. Each bucket has 33.3 % changes to get the coin. The same happens with Natural Language Classifier service. It is trained to classify input text into predefined classes.
If you create a classifier with 3 classes and you try to classify text that wasn't in the training data, NLC will still classify your sentence to one of the three classes you defined. If your output is 60% then the other two buckets will get the remaining 40%.
Sometimes you could get a high score and that's normal when you have classes that are very different.

LDA: Why sampling for inference of a new document?

Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:
When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.
An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.
Thanks Ben
Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].
For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.
As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)
[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.
The real issue is computational complexity. If each of N tokens in a document can have K possible topics, there are K to the N possible configurations of topics. With two topics and a document the size of this answer, you have more possibilities than the number of atoms in the universe.
Sampling from this search space is, however, quite efficient, and usually gives consistent results if you average over three to five consecutive Gibbs sweeps. You get to do something computationally impossible, and what it costs you is some uncertainty.
As was noted, you can get a "deterministic" result by setting a fixed random seed, but that doesn't actually solve anything.

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

Explanation of solr max score

When I choose to view the score field in solr results I see the score assigned by solr to every document returned and a maxscore value that is the score of the topmost returned document.
I need to know is there a cut-off to the solr score or not. I mean if the maxscore is 6.89343 or 2.34365, so does this mean that it is 6.89343 of 10 as the final score? or how can I decide that I'm close to the most correct result.
If possible, I need a simple explanation of the scoring algorithm used by solr.
The maxscore is the scoring of the topmost document in the search results.
There is no cutoff for the maxscore, and depends upon the scoring calculations and normalization done by Lucene/Solr.
The topmost document would have the maxscore, while you would get an idea from the scores of the documents below it, as to how off they are from the topmost.
For Scoring explaination you can check link
If it is indeed a z-score from a normal distribution then you can calculate the CDF (as it appears here ). The CDF will give you a bounded score from 0 to 1. Its hard for me to interpret what the CDF really means in this case given the un-normalized score is calculated in several steps, but you can sort of think of it as the probability that you got the right answer as long as your collection is well populated with the relevant material.

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...

Resources