How can I rank results lower in SOLR if two fields match at the same time? - solr

I have records with a "title" and a "brand" fields and i query both fields.
Sometimes a record has the brand in the title, which will result in higher scores, but I want to score them the same.
How can i rate records lower were both fields match?

Your solution is not ideal.
In Solr, there is the Dismax query parser that allows you to search for individual terms across several fields, using some other parameters to influence the final score.
The q parameter defines the main query while the qf parameter can be used to specify a list of fields with which to search.
In addition, the tie parameter lets you control how much the final score of the query will be influenced by the scores of the lower-scoring fields compared to the highest-scoring field.
Let's make a simple example.
Using the standard query parser this is what you will obtain running this query (q=adidas):
http://localhost:8983/solr/indexName/select?q=title:adidas%20OR%20brand:adidas&fl=id,title,brand,score
"docs": [
{
"id": "2",
"title": "Shoes Adidas",
"brand": "Adidas",
"score": 0.9623127
},
{
"id": "1",
"title": "Shoes",
"brand": "Adidas",
"score": 0.31506687
},
{
"id": "6",
"title": "Shirt",
"brand": "Adidas",
"score": 0.31506687
}
]
The doc with id 2 has a higher score than the others because the score is the sum of two clauses ('adidas' in title + 'adidas' in brand).
If you perform a Dismax query with tie=0 (a pure "disjunction max query"):
http://localhost:8983/solr/indexName/select?defType=dismax&q=adidas&qf=brand%20title&fl=id,title,brand,score&tie=0
You will obtain:
"docs": [
{
"id": "2",
"title": "Shoes Adidas",
"brand": "Adidas",
"score": 0.6472458
},
{
"id": "1",
"title": "Shoes",
"brand": "Adidas",
"score": 0.31506687
},
{
"id": "6",
"title": "Shirt",
"brand": "Adidas",
"score": 0.31506687
}
]
The doc with id 2 has a lower score than before because only the maximum scoring subquery contributes to the final score, i.e. it takes the max score between 0.6472458 and 0.31506687 without summing them (0.9623127).
With the qf parameter, it is also possible to assign a boost factor to increase or decrease the importance of a particular field in the query, for example:
&qf=brand^3 title
It makes matches in brand much more significant than matches in title.
In any case, boosting should be used with caution because it may lead to unexpected results. Every decision with boosting should be supported by an online and offline search relevance evaluation.
Can this help you?

I solved it by removing all occurrences of the brand in the title (and other fields) when writing the index.

Related

mongodb design issue. Should I use sharding cluster or divide documents to different collection when I have multiple same value of key

I have a Collection looks like :
[{
"company": "A",
"name": "N1",
"age": "C1"
},{
"company": "A",
"name": "N2",
"age": "C2"
},{
"company": "B",
"name": "N3",
"age": "C3"
}]
I have 2 optimization strategy:
1. sharding
sharding key: company.
Then company A / company B should divide to two shards.
And I will store them in different mongod server.
2. divide them by collection name
company A 's collection is
col_A = [{
"name": "N1",
"age": "C1"
},{
"name": "N2",
"age": "C2"
}]
company B 's collection is
col_B = [{
"name": "N3",
"age": "C3"
}]
Which plan is better?
I think the second solution reduces the number of primary keys and the amount of data. But it will cause the number of collections to increase.
Which should I choose? Thanks a lot...
It depends on your data size and how your scaling strategy.
Sharding is the better approach when you are talking about the future data size as it will better handle horizontal scaling. with your second option, it won't be easy to scale horizontally.
If you're taking the first option, you probably want to consider other potential sharding keys that could have the best performance result for your query. For example, if one company was making the most queries through the entire application, the sharding base on the company might not be the best.
One more advantage of sharding is you don't need to refractory your model/repository that much compared to using multiple collections.

Boost document in solr where document's field contains some value

I have documets in solr in below format
{ "documents": [ {
"custom_string_New Arrival": "false",
"custom_string_Brand Name": "GB",
"custom_string_Product Name": "GB GB Girls Big Girls 7%2D16 Flutter%2DSleeve Jumpsuit",
"score": 11.223517,
"id": "67012"
},
{
"custom_string_New Arrival": "false",
"custom_string_Brand Name": "Lucy Paris",
"custom_string_Product Name": "Lucy Paris Knit Camille Sleeveless Belted Jumpsuit",
"score": 11.223517,
"id": "50097"
} ] }
I want boost a document whose custom_string_Product Name contains "Paris Knit"
I am creating a solr query with query parameter
bq=(custom_string_Product\ Name:(*Paris Knit*))^5000
I am expecting that the document with id= 50097 should come at the top, but i am not getting the expected result.
But if if do
bq=(custom_string_Product\ Name:(*Knit*))^5000
then I get the correct response.
the difference is only that in first query there is a space in between the search term.
When you're using wildcard queries (i.e. a * is present), most analysis is skipped (except those that are multitermaware, which are a few filters). In this case it simply doesn't work because there is no tokens matching Paris Knit - the tokens are probably stored as paris and knit (and not as one single token).
You can use either a string type field or a KeywordTokenizer for the field type - the KeywordTokenizer allows you to add a LowercaseFilter as well, so that your boosts becomes case insensitive.

Azure Search different scores for exact match

I have Table : User, with fields say firstName, lastName
If I search for Amit and use only searchField as firstName, i get different scores.
$count=true&search=Amit^2&searchFields=firstName&$select=firstName&queryType=full
"value": [
{
"#search.score": 7.986226,
"firstName": "Amit"
},
{
"#search.score": 7.986226,
"firstName": "Amit"
},
...
...
...
{
"#search.score": 7.986226,
"firstName": "Amit"
},
{
"#search.score": 7.9655724,
"firstName": "Amit"
},
Above is small resultset but i can see score changing after 15-20 results.
I was expecting same score if firstName is same, since complex query can be sort on score, last Name.
The search score for a document is a combination of how well a document matches a query and how relevant it is compared to "nearby" documents. Depending on the exact partitioning of documents into shards, exact matches may get different scores, but they will always score higher than non-exact matches.

MongoDB How to make 2 nested documents have the same _id?

I currently have 2 documents in my mongodb collection. They are:
"_id": "5b6c7109c21dfe2a4b557b1e",
"title": "Childish Gambino",
"datetime": "2018-09-16T00:00:00.000Z",
"venue": {
"name": "Madison Square Garden",
"city": "New York",
"state": "NY"
"_id": "5b6c7109c21dfe2a4b557b1d",
}
and
"_id": "5b6c71133acdbe2a4e51615d",
"title": "The Eagles",
"datetime": "2018-09-12T00:00:00.000Z",
"venue": {
"name": "Madison Square Garden",
"city": "New York",
"state": "NY"
"_id": "5b6c71133acdbe2a4e51615c",
}
Although these are two different concerts, both take place at Madison Square Garden. However, in the 2 documents, Madison Square Garden has 2 different "_id" attributes. Is there a way I can make them have the same _id or would I have to manually input the same _id every time I add a document that takes place at Madison Square Garden? Is manually inputting _id generally considered an okay practice? New to mongodb so any help would be appreciated. Thanks!
If you use MongoDB version 3.4 and up you could use $lookup which means that in your schema you could add a reference to the actual Madison Square Garden (which I assume resides in another collection). This way your data could look something like this:
"_id": ObjectId('5b6c71133acdbe2a4e51615d'),
"title": "The Eagles",
"datetime": "2018-09-12T00:00:00.000Z",
"venueId": ObjectId('5b6c71133acdbe2a4e51615c')
And you would be able to cross-reference that venue using the $lookup feature.
Note: This is by no means more efficient than having the data embedded but it gives you one reference _id.

Why is it possible to get duplicate results from Azure Search when paging?

Sometimes when using Azure Search's paging there may be duplicate documents in the results. Here is an example of a paging request:
GET /indexes/myindex/docs?search=*$top=15&$skip=15&$orderby=rating desc
Why is this possible? How can it happen? Are there any consistency guarantees when paging?
The results of paginated queries are not guaranteed to be stable if the underlying index is changing, or if you are relying on sorting by relevance score. Paging simply changes the value of $skip for each page, but each query is independent and operates on the current view of the data (i.e. – there is no snapshotting or other consistency mechanism like you’d find in a general-purpose database).
Here is an example of how you might get duplicates. Assume an index with four documents:
{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }
{ "id": "4", "rating": 1 }
Now assume you want to page through the results with a page size of two, ordered by rating. You’d execute this query to get the first page:
$top=2&$skip=0&$orderby=rating desc
And get these results:
{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }
Now you insert a fifth document into the index:
{ "id": "5", "rating": 4 }
Shortly thereafter, you execute a query to fetch the second page of results:
$top=2&$skip=2&$orderby=rating desc
And get these results:
{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }
Notice that you’ve fetched document 2 twice. This is because the new document 5 has a greater value for rating, so it sorts before document 2 and lands on the first page.
In situations where you're relying on document score (either you don't use $orderby or you're using $orderby=search.score()), paging can return duplicate results because each query might be handled by a different replica, and that replica may have different term and document frequency statistics -- enough to change the relative ordering of documents at page boundaries.
For these reasons, it’s important to think of Azure Search as a search engine (because it is), and not a general-purpose database.

Resources