I have written a small search engine as my weekly project. It is based upon cosine similarity between query vector and document vector. Vector is calculate using of tf-idf sores of tokens.
I have come to know about Apache Solr which is a full text search engine. My question is does solr use cosine similarity internally when rank search results?
No. Solr uses something similar to cosine similarity, but not quite the same - there are some key differences.
If you visit that same link (https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) and scroll further down, you will see "Lucene Conceptual Scoring Formula" and "Lucene Practical Scoring Formula" that give more details.
Ignoring any index/query-time boosts, the following are some key differences:
1. Different document normalization factor
Instead of normalizing each document by the Euclidean norm of its tf-idf vector, it uses "doc-len-norm". For the default similarity measure (DefaultSimilairty) this is just 1/sqrt(number of terms in the doc) which basically equals 1/sqrt(sum(tf)) - i.e., where tf is the sum of the term counts in the doc - no squaring as with the Euclidean norm and the idf for each term is left out. Furthermore this value is rounded to a byte to save space. This will most often come out to a different value than the normalization factor as used for cosine similarity.
2. Extra "coord" boost
There is also an extra value multiplied onto the score equal to:
the number of query terms matched in the document / the total number of terms in the query.
This gives an extra boost for fields (documents) matching more of the query terms, and may be of questionable value. This essentially is multiplying the tf-idf vector score with another inner product - the inner product of these vectors converted to boolean vectors (0 if it does not have the given term, 1 if it does) with the query vector only normalized by its Euclidean norm.
Yes, Solr (which runs on top of Lucene) does use Cosine similarity. From the Lucene documentation:
VSM score of document d for query q is the Cosine Similarity of the
weighted query vectors V(q) and V(d)
cosine-similarity(q,d) = V(q) · V(d) / |V(q)| |V(d)|
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
If you're looking for actual vector similarity in Solr, there are two approaches:
1) use delimited payloads. There are a few plugins that implement this already, like https://github.com/moshebla/solr-vector-scoring and https://github.com/saaay71/solr-vector-scoring
2) use streaming expressions, which comes out of the box: https://lucene.apache.org/solr/guide/8_5/vector-math.html
The latter is slower, but more flexible.
Related
My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.
Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).
I'm loading into Solr data from a mysql database with DataImportHandler. Every document contains a popularity field (int type) that is calculated from another application and saved into mysql (this field is based on some rules relatives to the domain of application).
How can i use this value to improve solr ranking? Would be correct to sum the solr score with popularity value?
How bf can be used here?
A good starting point that'd probably work is multiplying the score by a sublinear function that increases (slowly) with popularity. For example,
newScore = score * log(1 + 0.5 * popularity)
To apply this boost you should use Solr's EDisMax query parser and pass the boost parameter with the following value:
&boost=log(sum(1, product(0.5, popularity)))
where popularity is the name of the field. You don't need to use the bf parameter since you should use a multiplicative boost, not an additive one.
The reason for adding 1 is to handle the case in which popularity=0 (so if each document's popularity is always at least 1, you don't need to add 1). The strength of the popularity effect can be increased or decreased by changing the 0.5 factor to some other value. For example, you can use a factor of 2 to increase the effect:
newScore = score * log(1 + 2 * popularity)
A good factor is probably around 9 / m where m is what you expect should be the median popularity, since in this case the boost of a "median document" (median in the sense that its popularity equals m) is going to be 1 (that is, its score won't be boosted at all).
Again, this is just a starting point and you'll have to try out different boosting functions until you find one that performs well.
I'm trying to make such a software which makes 2 text documents intelligently sort of like checking how much the text matches, not like DIFF
I have searched a quite on Google, And I found 2 things which is Graph & TFIDF.
But I'm confused between both of them, I don't know which one is better & also is there any other technique to match text documents
Did you look at measuring document similarity by cosine distance?
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them http://en.wikipedia.org/wiki/Cosine_similarity
If you have Document A and B, You can create two term vectors for the doc A and B. The term vector A would contain words form document A and each words frequency of the document. Instead of raw word frequency you can you TF-IDF weighting. Same goes for doc B. Once you have Term vector A and B you can calculate cosine similarity of term vector A and B which represents doc A and B.
Before creating term vectors you do some pre-processing tasks like filtering stop-words.
Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap
Is it possible to do a custom multiplication of the Score returned by Solr? We have a factor in the range of 1.00-1.30 based on our own formula and I wish to just multiply the "final" Solr score with this - without having it normalized.
I've tried using various boosts in DisMax, but none of them produce the desired result, because 1) custom value is added (not multiplied) to the score and 2) they are normalized (queryNorm) before addition.
I found a way to do this. Using the Extended DisMax query parser, introduced in 3.1, it offers all the same features as the normal DisMax, but with a few useful enhancements.
The one I needed was the boost parameter. It acts the same way as the bf parameter from DisMax, but instead of adding a normalized value to the score, it multiplies the boost into the score (without any normalization).
For more info, see the Solr Wiki on ExtendedDisMax