Utilizing meta-data in Elasticsearch - database

Can Elasticsearch utilize meta-data to improve queries? For example,
popularity of an object (number of people who requested it)
remembering previous search term (e.g. if someone searched doggg then chose the dog page, then the next time someone searches doggg, dog should be ranked higher in the query results)
If it's not possible, what other tools might be used to achieve this?

This kind of metadata can be used in a positive feedback system to improve search but Elasticsearch does not by itself store this kind of data; you will need to build a system to do this. As a couple of examples:
popularity of an object (number of people who requested it)
This could be achieved by indexing the popularity value into a field on the document and using a function score query with a field value factor function to take the popularity into account when calculating a relevancy score.
remembering previous search term (e.g. if someone searched doggg then chose the dog page, then the next time someone searches doggg, dog should be ranked higher in the query results)
You could index search terms for a given user, along with the actual term selected and use this as an input into the search that you perform for a user. You could take advantage of a terms suggester to provide suggestions for input terms based on the available terms within the corpus of documents. Terms suggester can be useful for providing spelling corrections guided by available terms.

Related

Defining thresholds on Azure Search Score

All,
We have a case in our application where we collect user satisfaction feedback for matches returned from Azure Search for our data. What we have noticed so far from the limited feedback we have is that there is a correlation between scores to user satisfaction (high scores result in better user satisfaction because a more useful match was found). When Azure Search scores are above 2.5, that seems to result in Happy rating for our application. But we’re not sure if this is just a coincidence and whether this approach is even sound.
We don’t know what is maximum range ( like 0-10) for Azure Search scores. Also this link seem to state the score would vary as a function of the data corpus also ( even when considering that the same query is used with different input data in our case).Is it even possible to define thresholds on Azure Search scores where we can drop significantly low-score matches and not show them at all to the user in our application? Are there any recommendations around this?
https://stackoverflow.com/a/27364573
Thanks.
The reply to the question you linked is accurate. The score value is dependent on the corpus you have in your index as it uses variables such as "document frequency" which depends on the documents you have in your index. As such, the same query-document pair could have a different score when calculated in the context of two different indexes.
There also isn't any specific range to that score as it is not meant to be used as an absolute value to be compared between results of different queries. The scoring value is meant to be used to rank the relative relevancy of documents to a specific query, within the same index.
However, since the score is returned as part of the search results, nothing is preventing you to use your own client side filtering within your application to dismiss results that have a score below a certain threshold if you have concluded that it makes sense in the context of your product.

Need clarification of boosting in Solr in terms of scoring

I am experimenting with boosting in Solr and have become confused how my document scores are being affected.
I have a collection of technical documents that contain fields like Title, Symptoms, Resolution, Classification, Tags, etc. All the fields listed are required except Tags which is optional. All fields are copied to _text_ and that field is the default search field.
When I run a default query
http://search:8983/solr/articles-experimental/select?defType=edismax&fl=id,%20tags,%20score&q=virtualization&qf=_text_
The top article (Article 42014) comes back with a score of 4.182179. This document has 6 instances of the word virtualization in multiple fields -- Title, Symptoms, Resolution, and Classification. This particular article does not have any Tags value.
I now want to experiment with boosting so that articles that have Tag values matching the search terms appear closer to the top of the results. To do this, I send the following query
http://search:8983/solr/articles-experimental/select?defType=edismax&fl=id,tags,score&q=virtualization&qf=tags^2%20_text_
which keeps the same Article 42014 at the top of the list but now with a score of 4.269944. However, results 2 through 65 now all have the same score of 4.255975. In the non-boosted query the scores range from 4.056591 down to 2.7029662.
In addition, the collection of document id coming back are not quite the same as before. I certainly expect some differences but not the extent that I am seeing considering that the vast majority of the articles coming back have the search term as a tag.
Ultimately, I am having trouble finding out exactly how boosting changes the score and what is an "appropriate" boost value. Understanding that it is probably subjective, what criteria should I be considering?
well, with all parameters you set for edismax (plus the default values for all the ones you don't set) Solr runs just the algorithm (BM25) nowadays and all scores will be calculated.
The specific boosting values etc you should use for your query are impossible to guess, you must try and retry. It is a known pain, I even built vifun a tool to help me visualize how different parameters affect score with edismax.

Solr Phonetic Algorithm for Person Name Search

I’m new to Solr and trying to use Solr for Person search in our project. Person record with fields like name, date of birth, gender and address. We tried using various fuzzy filters and phonetic filters to retrieve person record and getting decent results.
For Phonetic algorithm, we are using Beider Morse Phonetic Algorithm which is comparatively better than other algorithms we have tried so far. I would like to know if anyone has used Solr very specifically for Person search and could you please share you experience with Phonetic algorithm that you have used for name match and any comparative study on those.
Many Thanks
Name matching is quite a common use case for Solr, so I am sure there are lots of people with experience in it.
But I don't think just picking the best phonetic filter will be enough. No matter what you are going to need to customize it for your specific case. For instance:
besides names/surnames etc, I typically have always encountered other fields (nationality, age, gender...). You do too. You typically leverage those as fq or for just boosting.
are false positive or false negatives equally bad or one is less severe than the other?
your corpus contains a single language or the names can be from anywhere in the world?
and on and on. Basis has a commercial product for this I think you can see their presentation at Lucene/Solr revolution 2015 on this subject

open source ranking algorithms used by Solr

I am working on Solr. I want to know what ranking algorithm it uses when output a query. I am also using Solr search.
Solr uses the Lucene Core , a text search library written in Java, for text search. This is the same project that also powers Elasticsearch, so everything here applies to Elasticsearch too.
The core ranking algorithm (also known as the similarity algorithm) is based on Term-Frequency/Inverse-Document-Frequency, or tf/idf for short . td/idf takes the following factors into account:
(I've copied in a description of tf/idf below from the Elasticsearch documentation - the description would be identical for Solr but this is much better written and easier to understand)
Term frequency
How often does the term appear in the field? The more often, the more
relevant. A field containing five mentions of the same term is more
likely to be relevant than a field containing just one mention.
Inverse document frequency
How often does each term appear in the index? The more often, the less
relevant. Terms that appear in many documents have a lower weight than
more uncommon terms.
Field norm
How long is the field? The longer it is, the less likely it is that
words in the field will be relevant. A term appearing in a short title
field carries more weight than the same term appearing in a long
content field.
You can find the specifics of the Lucene similarity scoring here: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Keep in mind that Solr/Lucene supports a rich set of functionality to alter this scoring. This is best read about here in the discussion on Lucene scoring.
If you want to read more about scoring and how to change it I'd start here:
http://wiki.apache.org/solr/SolrRelevancyFAQ
And then I would read up a bit on what a Function Query is:
FunctionQuery allows one to use the actual value of a field and
functions of those fields in a relevancy score.
Basically it provides you with a relatively easy to use mechanism to adjust the relevancy score of a document as a function of the values within certain fields:
http://wiki.apache.org/solr/FunctionQuery

Solr popular search terms using Suggester

Referring to the following wiki text derived from the http://wiki.apache.org/solr/Suggester.
"A common need in search applications is suggesting query terms or phrases based on incomplete user input. These completions may come from a dictionary that is based upon the main index or upon any other arbitrary dictionary. It's often useful to be able to provide only top-N suggestions, either ranked alphabetically or according to their usefulness for an average user (e.g. popularity, or the number of returned results)."
How the Solr knows which searched terms are more popular?
Solr doesn't know it by itself, you have to do your part and record all searches made by your users, and feed that to Solr, then, after you have some meaningful usage, you can use it for your suggestions.

Resources