I trained a rankprob model using tensorflow. The input to the model is a query and 2 documents. The model's output is the probability that doc1 is ranked higher compared to doc2. Is it possible to use this model as a level2 reranking in Vespa? If yes, can some one point me to relevant documentation?
When Vespa evaluates the ranking expressions configured in your ranking profile it does a document at a time and produce a final relevancy score which can be used to rank (order) the recalled documents.
For tensorflow model integration see:
https://docs.vespa.ai/documentation/tutorials/blog-recommendation-nn.html
https://docs.vespa.ai/documentation/tensorflow.html
Related
I want to know if Retrieve & Rank service, and especially during the ranking, allows searching by proximity.
Example :
Ranker learned :
a. Query = "I have a problem with my mailbox"
b. Documents with pertinence score : "Doc1":3, "Doc2":4", "Doc3":1
So we can imagine that when I use Retrieve service only, the result of the query is :
1. Doc1
2. Doc2
3. Doc3
And when I use the Ranker to re-order the previous result, we have :
1. Doc2
2. Doc1
3. Doc3
At this moment, everything is OK.
Now I want to execute a new (and similar) query by using the Ranker : "I encountered a problem with my mailbox"
The question is :
Does the Ranker will match my new query with the query that it learned previously? So the result will be :
1. Doc2
2. Doc1
3. Doc3
Or the Ranker will not match my new query with the query that it learned previously, and so the result will be the result from the Retrieve service execution :
1. Doc1
2. Doc2
3. Doc3
This documentation https://www.ibm.com/watson/developercloud/doc/retrieve-rank/plugin_query_syntax.shtml , and especially this text, makes me think that the Ranker will not match the queries :
The following modifiers are not supported with the /fcselect request handler:
- [...]
- Search by proximity
- [...]
But when I try this example, it seems that the Ranker match the queries...
Thanks for your time.
So the ranker does not work by memorizing your training questions OR by mapping new questions to the closest question in the training data set. In fact, the ranker doesn't directly work with questions at all.
Instead, as per the overview material in the RnR documentation, the ranker uses an approach called 'learning-to-rank' (it might be helpful to take a look through the wikipedia entry for it: https://en.wikipedia.org/wiki/Learning_to_rank).
Essentially, the learning-to-rank approach is to first generate a bunch of features that capture some notion of how well each of the candidate documents returned from the initial Retrieve phase matches the query. See this post for more info on features: watson retrieve-and-rank - manual ranking.
Then, based on the training data, the ranker will learn how to pay attention to these features in order to best re-rank the set of candidate documents in order to optimize for relevance. This approach allows it to generalize to different questions that come in the future (these might have the same topics, or they might not).
All:
I wonder if there is any way that we can use lucene to do search keyword relevancy discovering based on search history?
For example:
The code can read in user search string, parse it, extract the keyword and find out which words have most possibility to come together when search.
When I try Solr, I found that the lucene has a lot of text analysis feature, that is why I am wondering if there is any way we can use it and combine with other machine learning libs(if necessary) to achieve my goal.
Thanks
Yes and No.
Yes.
It should work. Simply treat every keyword as a document and then use MoreLikeThis feature of lucene, which constructs a lucene query on the fly based on terms within the raw query. The lucenue query is then used to find other similar documents (keywords) in the index.
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"keywords"}); // specify the field for similarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 20); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
Suppose in your indexed keywords, you have such keywords as
iphone 6
apple iphone
iphone on sale
apple and fruit
apple and pear
when you launch a query with "iphone", I am sure you will find the first three keywords above as "most similar" due to the full term match of "iphone".
No.
The default similarity function in lucene never understands that iphone is relevant to Apple Inc, thus iphone is relevant to "apple store". If your raw query is just "apple store", an ideal search result within your current keywords would be as follows (ordered by relevancy from high to low):
apple iphone
iphone 6
iphone on sale
unfortunately, you will get below results:
apple iphone
apple and fruit
apple and pear
The first one is great however the other two are totally unrelated. To get the real relevancy discovery (using the semantic) , you need more work to do topic modeling. If you happen to have a great way (e.g., a pre-trained LDA model or wordvec ) to pre-process each keyword and produce a list of topic ids, you can store those topic ids in a separate field with each keyword document. Something like below:
[apple iphone] -> topic_iphone:1.0, topic_apple_inc:0.8
[apple and fruit] -> topic_apple_fruit:1.0
[apple and pear] -> topic_apple_fruit:0.99, topic_pear_fruit:0.98
where each keyword is also mapped to a few topic ids with weight value.
At query time, you should run the same topic modeling tool to generate topic ids for the raw query together with its terms. For example,
[apple store] -> topic_apple_inc:0.75, topic_shopping_store:0.6
Now you should combine the two fields (keyword and topic) to compute the overall similarity.
We have solr index which has multiple collections i.e. collection_data_sales and collection_data_marketing. So when the user performs a search query, both the collections are queried upon using collection alias. Both collections have same solr schema.
Is there a way to boost the result from a specific collection ?
i.e. Suppose user specifies collection sales data, then search should happen on both collection_data_sales and collection_data_marketing but boost should be given for documents from collection_data_sales.
If you are able to differentiate both collections using data from it it will be enough. Lets imagine that in schema you have field type so for collection_data_marketing you have type:marketing and for collection_data_sales you have type:sales.
The only thing now you have to do is to use boost function like for example this:
bf=sum(product(query($q1),10), product(query($q2,3)))&q1=type:sales&q2=type:marketing
In this example sales will have weight 10 and marketing will have weight 3
I am developing a Spring-based website and I need to use a search engine to provide "customized" search results. I am considering Solr or Elastic.
Here is what I mean by "customized".
Suppose I have two fields A and B to search against.
Suppose that there are two visitors and I am able to profile them by tracking their activities. Suppose visitor 1 constantly uses or searches for value a (of A) and visitor 2 value b (of B). Now both visitors search for records that satisfy A=a OR B=b.
Can Solr or Elastic return results in different order for visitor 1 and 2? I mean that for example, results with A=a are ahead of only B=b results for visitor 1? And the opposite for visitor 2?
I understand that I need to pass some signal to a search engine to ask the engine to give more "weight" to one of the fields.
Thanks and Regards.
It looks like you just need to give a different weight to the fields you're querying on depending on the user that's executing the query.
You could for example use a multi_match query with elasticsearch, which allows you to search on multiple fields giving them different weights as well. Here is an example that makes the fieldA more important:
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "fieldA^3", "fieldB" ]
}
}
That way the score is influenced by the weights that you put on the query, and if you sort by score (default) you get the results in the expected order. The weights assigned to the fields need some fine-tuning though depending on your documents and the query you execute.
I am searching "product documents". In other words, my solr documents are product records. I want to get say the top 50 matching products for a query. Then I want to be able to sort the top 50 scoring documents by name or price. I'm not seeing much on how to do this, since sorting by score, then by name or price won't really help, since scores are floats.
I wouldn't mind if I could do something like map the scores to ranges (like a score of 8.0-8.99 would go in the 8 bucket score), then sort by range, then by names, but since there is basically no normalization to scoring, this would still make things a bit harder.
Tl;dr How do I exclude low scoring documents from the solr result set before sorting?
You can use frange to achieve this, as long as you don't want to sort on score (in which case I guess you could just do the filtering on the client side).
Your query would be something along the lines of:
q={!frange l=5}query($qq)&qq=[awesome product]&sort=price asc
Set the l argument in the q-frange-parameter to the lower bound you want to filter score on, and replace the qq parameter with your user query.
As observed by Karl Johansson, you could do the filtering on the client side: load the first 50 rows of the response (sorted by score desc) and then manipulate them in JS for example.
The jQuery DataTables plugin works fantastically for that kind of thing: sorting, sorting on multiple columns, dynamic filtering, etc. -- and with only 50 rows it would be very fast too, so that users can "play" with the sorting and filtering until they find what they want.
I don't think you can simply
exclude low scoring documents from the
solr result set before sorting
because the relevance score is only meaningful for a given combination of search query and resulting document list. I.e. scores are only meaningful within a given search and you cannot set some threshold for all searches.
If you were using Java (or PHP) you could get the top 50 documents and then re-sort this list in your programming language but I don't think you can do it with just SOLR.
Anyway, I would recommend you don't go down this route of re-sorting the results from SOLR, as it will simply confuse the user. People expect search results to be like Google (and most other search engines), where results come back in some form of TFIDF ranking.
Having said that, you could use some other criteria to separate documents with the same relevance scores by adding an index-time boost factor based on a price range scale.
I'd suggest you use SOLR to its strengths and use facets. Provide a price range facet on the left (like Ebay, Amazon, et al.) and/or a product category facet, etc. Also provide a "sort" widget to allow the results to be sorted by product name, if the user wants it.
[EDIT] this question might also be useful:
Digg-like search result ranking with Lucene / Solr?