Object with longer field is returned against an object with a short field - tf-idf

Let's say that we have an index with two objects:
{
"name": "iPhone 6s Plus big screen, super fast, ultra responsive, blah blah"
}
and:
{
"name" : "iPhone 6s Plus"
}
Now, when i search for iPhone 6s Plus, it returns the first object which does not make sense, since the first object contains more words (or noise) than the second object for the given query. In other words, a term appearing in a short name field should carry more ranking points than the same term appearing in a long name field
Algolia uses a TF/IDF algorithm, which takes account of the Field-length norm, so the second object should have higher score than the first one.
So why does the first object has higher score than the second one? Is there any settings option that I am missing?

I found the answer, but I am not sure if it is 100% correct, or if there is a better way to achieve this.
Login to Algolia -> Select Your Instance -> Go to Ranking Tab.
In the Ranking Formula, add a new row. The new row, should have Attribute type {{the name of the column, in this example is "title"}} and set it to Ascending.
With that, you will achieve what we are looking for.

One option is to break up the value into two different attributes, one for just the product name and another for the description. Doing that also lets you prioritize the product name in your searchable attributes, which would lead to better relevance in most cases.

Related

Is there a way to execute a query on SOLR where I have a list of words that need to be in different fields?

everybody. I'm trying to elaborate a query that complies with the following:
Find a set of words that appear in a group of fields. For example, i want to find the documents that have the words soccer, ball and goalkeeper in one or both fields: 'sport_name' and 'descritpion'.
The problem I'm having is that I need to treat both fields as only one for getting results like:
{
"sport_name":"soccer",
"description": "...played with a ball... positions are goalkeeper"
}
I need that the words appear in any field, but all the words need to appear in the "concatenated bigger field".
Is there a way to do this during query time?
Thanks!!
You can do this by using the edismax handler (defType=edismax), setting q.op=AND (since all the terms has to be present) and using qf=sport_name description to tell Solr to search for the given terms in both fields.
You can also use qf=sport_name^2 description to say that you want to weigh hits in the sport_name field twice as much as hits in the description field. So if there was a sport named something with ball, that hit would contribute more to the score than if the same content were present in the description field.

Index structure for azure search

I'm putting together a query to index medicines. A user should be able to enter their search term into a single search box. Their search term might be either a brand name for a drug, a generic name (the underlying compound on which all brands are based) or an indication and they should be returned a list of medicines that correspond to their search. I'd like to have a category facet for the type - either indication, brand or generic.
To have a category facet, my understanding is that I'd have to send my data through as one row per search term where that search term might be a brand, indication or a generic, rather than one row per brand with columns for generic list and indication. Is this correct or is there another way to get at what I'm wanting to do?
I hope I understand your ask here. From the screenshot you provided, I would assume what you would want to do is make the field "MedicineInformationType" a Facetable field in your Azure Search index and make the field "SearchTerm", "Product", "GenericList", and "ActionList" all Searchable fields in your Azure Search index (although I am not sure why you would want the "SearchTerm" field if the term in this field is already in one of the other fields).
If you structure your index this way, you can do a search for say "phosphate" and facet over the "MedicineInformationType" field to get a count of the results that are generic or brands.
For example (as a REST call):
search=phosphate&facet=MedicineInformationType

IBM Watson retrieve and rank service - boolean operator

I'm writing the csv file to train a ranker in Watson Retrieve and Rank service, with many rows [query,"id_doc","relevance_score",...].
I have two questions about the structure of this file:
I have to distinguish two documents, depending on whether or not the query contains the word "not". More specific:
the body and the title of the first document contain "manager"
the body and the title of the second document contain "not manager"
Thus, if the query is like "I'm a manager. How do I....?" then the first document is correct, but not the second one.
if the query is like "I'm not a manager..." then the second document is correct, but not the first one.
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
2. This service has also a web interface to train a ranker. The rating used in this site is: 1-> incorrect answer, 2-> relevant to the topic but doesn't answer to the question, 3-> good, but can be improved, 4->perfect answer.
Is the relevance score used in this file the same one of the web interface?
Thank you!
Is there any particular syntax that can be used to write the query in a proper way? Maybe using boolean operator? Is this file the right place to apply this kind of filter?
As you hinted, this file is not quite the appropriate place for using filters. The training data will be used to figure out what types of lexical overlap features the ranker should pay attention to when trying to optimize the ordering of the search results from Solr (see discussion here for more information: watson retrieve-and-rank - manual ranking).
That said, you can certainly add at least two rows to your training data like so:
The first can have the question text "I'm a manager. How do I do something" along with the corresponding correct doc id and a positive integer relevance label.
The second row can have the question text "I'm a not manager. How do I do something" along with the answering doc id for non-managers and a positive integer relevance label.
With a sufficient number of such examples, hopefully the ranker will learn to pay attention to bigram lexical overlap features. If this is not working, you can certainly play with pre-detecting manager vs not manager and apply appropriate filters, but I believe that's done with a separate parameter (fq?)...so you might have to modify train.py to pass the filter query appropriately (the default train.py takes the full query and passes it via the q to the /fcselect endpoint).
Is the relevance score used in this file the same one of the web interface?
Not quite, the web interface uses the 1-4 star rating to improve the UI for data collection, but then compresses the star ratings to a smaller relevance label scale when generating the training data for the ranker. I think the compression gives bad answers (i.e. star ratings < 3) a relevance label of 0 and passes the higher star ratings as is so that effectively there are 3 levels of rating (though maybe someone on the UI team can add clarification on the details if need be). It is important for the underlying ranking algorithm that bad answers receive a relevance label of 0.

Is there a way to give certain document fields a weight in app engine full text search?

In my application, I'd like the search API to value a match in the name field, higher than a match in the other fields.
A user can also fill in an 'about' message, which has way more text, so it could be more likely that a match happens there. Is there any way to do this?
SortExpression (https://developers.google.com/appengine/docs/python/search/sortexpressionclass) provides a way to set the sort based on a particular expression, but it only offers a document-wise score (i.e. not per field).
Another (probably bad idea) is to search only by name field, using a query string like "name: my_search_term_here"
So from my knowledge Search API of Google App Engine offers no way to bias one field during search (i.e. similar to the ^ operator in ApacheSolr Lucene).
I am not aware of this functionality in Google AppEngine. Having said that, you could split this problem in two steps. First search for a term in name field, which would give you a list of documents, call it list1. Then search for the same term in less important fields. This would give you another list, call it list2. You can then combine these two lists in any way you want - i.e. make a new list3 which is a concatenation of list1 and list2 and all items from list1 are before items from list2. Hope this helps.

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Resources