Defining thresholds on Azure Search Score - azure-cognitive-search

All,
We have a case in our application where we collect user satisfaction feedback for matches returned from Azure Search for our data. What we have noticed so far from the limited feedback we have is that there is a correlation between scores to user satisfaction (high scores result in better user satisfaction because a more useful match was found). When Azure Search scores are above 2.5, that seems to result in Happy rating for our application. But we’re not sure if this is just a coincidence and whether this approach is even sound.
We don’t know what is maximum range ( like 0-10) for Azure Search scores. Also this link seem to state the score would vary as a function of the data corpus also ( even when considering that the same query is used with different input data in our case).Is it even possible to define thresholds on Azure Search scores where we can drop significantly low-score matches and not show them at all to the user in our application? Are there any recommendations around this?
https://stackoverflow.com/a/27364573
Thanks.

The reply to the question you linked is accurate. The score value is dependent on the corpus you have in your index as it uses variables such as "document frequency" which depends on the documents you have in your index. As such, the same query-document pair could have a different score when calculated in the context of two different indexes.
There also isn't any specific range to that score as it is not meant to be used as an absolute value to be compared between results of different queries. The scoring value is meant to be used to rank the relative relevancy of documents to a specific query, within the same index.
However, since the score is returned as part of the search results, nothing is preventing you to use your own client side filtering within your application to dismiss results that have a score below a certain threshold if you have concluded that it makes sense in the context of your product.

Related

ANN Performance Using Separate Namespaces

I am trying to perform ANN, but my data is split into partitions or "tenants." Searches are always restricted to a single tenant, which represents a small percentage of the total documents.
I first tried implementing this using a filter on a tenant string attribute. However, I encountered this piece of documentation, that suggests the performance will be poor:
There is a small problem here however. If the eligibility list is small in relation to the number of items in the graph, skipping occurs with a high probability. This means that the algorithm needs to consider an exponentially increasing number of candidates, slowing down the search significantly. To solve this, Vespa.ai switches over to a brute-force search when this occurs. The result is a efficient ANN search when combined with filters.
What's the best way to solve my problem? Will partitioning my data into separate namespaces trigger the creation of a separate HNSW graph per namespace?
Performance will be fine, the query planner will just choose to not use the ANN index for these queries. You'll find lots of details on this topic, including how to tune this, in this blog post: https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/
If all your queries are towards a single tenant which is a small percentage of the total documents I don't think you necessarily need to create an HNSW index at all, but this depends on the absolute numbers and the largest "small percentage".
(Namespaces are not relevant here - their only purpose is to safely add a string to ids so that you can have multiple sources of ids and still be guaranteed global uniqueness.)

In Azure Search, is there a way to scope facet counts to the top parameter?

If I have an index with 10,000,000 documents and search text and ask to retrieve the top 1,000 items, is there a way to scope the facets to those 1,000 items?
My current problem is:
We have a very large index with a few different facets, including manufacturer. If I search for a product (WD-40 for instance), that matches a lot of different document and document fields. It finds the product and it is the top scoring match, but because they only make 1 or 2 products, the manufacturer doesn't show up as a top facet option because it is sorted by count.
Is there a way to scope the facets to the top X documents? Or, is there a way to only grab documents which are above a certain #search.score?
The purpose of a refiner is to give users options to narrow down the result set. I would say the $top parameter and returned facets works as it should. Trying to limit the number of refiners to be based on the top 1000 results is a bad idea when we think about it. You'll end up with confusing usability- and recall issues.
Your query for WD-40 returns a large result set. So large that there are 155347 unique manufacturers listed. I'm guessing you have several million hits. The intent of that query is to return the products called WD-40 (my assumption). But, since you search all properties in all types of content, you end up with various products like doors, hinges, and bikes that might have some text saying that "put some WD-40 on it to stop squeaks".
I'm guessing that most of the hits you get are irrelevant. Thus, you should either limit the scope of your initial query by default. For example, limit to searching only the title property. Or add a filter to exclude a category of documents (like manuals, price lists, etc.).
You could also consider submitting different queries from your frontend application. One narrowly scoped query that retrieves the refiners and another, broader query that returns the results.
I don't have a relevant data set to test on, but I believe the $top parameter might do what you want. See this link:
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#top-optional
That said, there are other approaches to solve your use case.
Normalize your data
I don't know how clean your data is. But, for any data set of this size, it's common that the manufacturer name is not consistent. For example, your manufacturer may be listed as
WD40 Company
WD-40 Company
WDFC
WD 40
WD-40 Inc.
...
Normalizing will greatly reduce the number of values in your refiners. It's probably not enough for your use case, but still worth doing.
Consider adding more refiners
When you have a refiner with too many options it's always a good idea to consider having more refiner with course values. For example a category or perhaps a simple refiner that splits the results in two. Like a "Physical vs. Digital" product as a first choice. Or consumer vs. professional product. In stock or on back-order. This pattern allows users to quickly reduce the result set without having to use the brand refiner.
Categorize your refiner with too many options
In your case, your manufacturer refiner contained too many options. I have seen examples where people add a search box within the refiner. Another option is to group your refiner options in buckets. For text values like a manufacturer, you could generate a new property with the first character of the manufacturer's name. That way you could present a refiner that lets users select a manufacturer from A-Z.

Utilizing meta-data in Elasticsearch

Can Elasticsearch utilize meta-data to improve queries? For example,
popularity of an object (number of people who requested it)
remembering previous search term (e.g. if someone searched doggg then chose the dog page, then the next time someone searches doggg, dog should be ranked higher in the query results)
If it's not possible, what other tools might be used to achieve this?
This kind of metadata can be used in a positive feedback system to improve search but Elasticsearch does not by itself store this kind of data; you will need to build a system to do this. As a couple of examples:
popularity of an object (number of people who requested it)
This could be achieved by indexing the popularity value into a field on the document and using a function score query with a field value factor function to take the popularity into account when calculating a relevancy score.
remembering previous search term (e.g. if someone searched doggg then chose the dog page, then the next time someone searches doggg, dog should be ranked higher in the query results)
You could index search terms for a given user, along with the actual term selected and use this as an input into the search that you perform for a user. You could take advantage of a terms suggester to provide suggestions for input terms based on the available terms within the corpus of documents. Terms suggester can be useful for providing spelling corrections guided by available terms.

Precision, Recall, ROC in solr

My final task is making a search engine. I'm using solr to access and retrieve data from ontology which later will be used as corpuses. I'm entirely new to these (information retrieval, ontology, python and solr) things.
There's a step in information retrieval to evaluate the query result. I'm planning to use Precision, Recall, and ROC score to evaluate this. Is there any way I can use function in solr to calculate the score of precision, recall, and ROC? From solr interface or even the codes behind is doesn't matter.
Unless I'm completely mistaken, precision and recall scores require you to know what the appropriate documents to retrieve and display are before comparing them to the documents retrieved from the search engine. The search already returns what it think is the perfect match for your query, so it's up to you to evaluate that result against the expected result (meaning that you know which documents should have been returned).
If the search engine could decide by itself, it would always give 1 (n/n) for both precision and recall, as that would be the perfect result. If it could evaluate what those numbers would be, it wouldn't need to include them in the search result at all.
If you query for a certain term, Solr will give you all documents containing that term (and if you want, variations of it - depending on your analysis chain). Tuning this relevancy is what your task is, and since it can't be done automagically - as it's dependent on your business case, you'll have to actually perform the measurements yourself with the answer key already decided.

Google App Engine sort costs for videogame's highscore tables

I'm considering creating my own GAE app to track players' highscores in my videogames. I've already created a simple app that allows me to send and recover Top 10 highscores (that is, just 10 scores are stored per game), but now I'm considering costs if things grow.
Say a game has thousands or millions of players (hehe, not mine). I've seen how applications like OpenFeint are able to sort your score and tell your exact rank in a highscore table with thousands of entries. You may be #19623, for example.
In order to keep things simple, I would create Top 100 score tables. But what if I truly wanted to store all scores and keep things sorted? Does it make sense to simply sort scores as they are queried from the database? I don't think so...
How are such applications implemented?
On GAE it's easy to return sorted queries as long as you index your fields. If your goal is just to find the top 100 scores, you can do an ordered query by score for 100 entities - you will get them in order.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_order
The harder part is assigning the number to the query. For the top 100, you'd basically go through the returned list of 100 entities, and print a number beside each of them.
If you need to find a user at a particular rank, you can use a cursor to make narrow your search to say whoever is at rank #19623.
What you won't be able to do efficiently with this is figure out the rank of a single entity. In order to figure out rankings using the built in index, you'd have to query for all entities, and find where that indivdual entity is in the list.
The laziest way to do the ranking would be something like search for the top 100, if the user is in there, show their ranking, if not, then tell them they are > 100. Another possibility is to occasionaly do large queries to get score ranges, store those, and then give the user a less accurent (you are in the top 500, top 1000 etc), without having the exact place.
Standard database indexing - both on App Engine and elsewhere - doesn't provide an efficient way to find the rank of a row/entity. One option is to go through the database at regular intervals and update the current rank. If you want ranks to be updated immediately, however, a tree-based solution is better. One is provided for App Engine in the app-engine-ranklist project.
We had the same problem with TyprX typing races (GWT + App Engine). The way we did it without going through millions of rows it to store high score like this:
class User {
Integer day, month, year;
Integer highscoreOfTheDay;
Integer highscoreOfMonth;
Integer highscoreOfTheYear;
}
Doing so you can get a sorted list of daily, monthly, yearly high scores with on query. The key is to update the users records with their own best score for each period as they finish their games.
Then we added save the result to memcache and voila.
Daniel
I'd think about using exception processing. How many of the thousands of results each day/hour will be a top 100 score? Keep a min/max top-100 range entity (memcached of course). Each score that comes is goes one direction if it is within the range, else another direction (task queue?) if not. Why not shunt the 99% of non-relevant work to another process, and only have to deal with 100+1 recs in whatever your final setup might be for changing the rankings.

Resources