Search by proximity and Ranker - ibm-watson

I want to know if Retrieve & Rank service, and especially during the ranking, allows searching by proximity.
Example :
Ranker learned :
a. Query = "I have a problem with my mailbox"
b. Documents with pertinence score : "Doc1":3, "Doc2":4", "Doc3":1
So we can imagine that when I use Retrieve service only, the result of the query is :
1. Doc1
2. Doc2
3. Doc3
And when I use the Ranker to re-order the previous result, we have :
1. Doc2
2. Doc1
3. Doc3
At this moment, everything is OK.
Now I want to execute a new (and similar) query by using the Ranker : "I encountered a problem with my mailbox"
The question is :
Does the Ranker will match my new query with the query that it learned previously? So the result will be :
1. Doc2
2. Doc1
3. Doc3
Or the Ranker will not match my new query with the query that it learned previously, and so the result will be the result from the Retrieve service execution :
1. Doc1
2. Doc2
3. Doc3
This documentation https://www.ibm.com/watson/developercloud/doc/retrieve-rank/plugin_query_syntax.shtml , and especially this text, makes me think that the Ranker will not match the queries :
The following modifiers are not supported with the /fcselect request handler:
- [...]
- Search by proximity
- [...]
But when I try this example, it seems that the Ranker match the queries...
Thanks for your time.

So the ranker does not work by memorizing your training questions OR by mapping new questions to the closest question in the training data set. In fact, the ranker doesn't directly work with questions at all.
Instead, as per the overview material in the RnR documentation, the ranker uses an approach called 'learning-to-rank' (it might be helpful to take a look through the wikipedia entry for it: https://en.wikipedia.org/wiki/Learning_to_rank).
Essentially, the learning-to-rank approach is to first generate a bunch of features that capture some notion of how well each of the candidate documents returned from the initial Retrieve phase matches the query. See this post for more info on features: watson retrieve-and-rank - manual ranking.
Then, based on the training data, the ranker will learn how to pay attention to these features in order to best re-rank the set of candidate documents in order to optimize for relevance. This approach allows it to generalize to different questions that come in the future (these might have the same topics, or they might not).

Related

Watson retrieve&rank methods rank VS search and rank

In Retrieve&Rank service documentation there are 2 methods about results ranking
Rank: Returns the top answer and a list of ranked answers with their ranked scores and confidence values (http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/retrieve-and-rank/api/v1/?node#rank)
Search and rank: Return reranked results for your query. The request is similar to the Search Solr standard query parser method (http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/retrieve-and-rank/api/v1/?node#query_ranker)
What are the differences? what "custom feature" means on rank method? when i need to use first method? and when the second method?
With the first, you're providing a question and a list of answers, and you're asking the service to rank the answers - to sort them in order of relevance based on the feature scores (that you also provide).
With the second, you're providing a question, and you're asking the service to do a Solr search to retrieve answers and then rank them in order of relevance.
The second is the most commonly used method - asking the service to do the search and to sort the responses ('retrieve and rank').
You'd use the first rank-only method if you want to provide the answers to be sorted, rather than use the Solr search to do that. Or if you wanted to modify the feature scores that are used to do the ranking.
There is a good description of that second part (using custom feature scores) here : https://medium.com/machine-learning-with-ibm-watson/developing-with-ibm-watson-retrieve-and-rank-part-3-custom-features-826fe88a5c63

Comparing two solr documents

I am trying to compare two documents in solr (say Doc A, Doc B), based on a common "name" field using solr query. Based on query A.name I get a result document B with a relevancy score of say SCR1. Now if i do it in the reverse way, i.e I query with B.name and i get the document A in somewhere in the result, but this time score of B with A is not the same SCR1.
I believe this is happening because of the no. of terms in Doc A.name and Doc B.name are different so similarity score is not same. Is it the reason for this difference?
Is there anyway I can get same score either way (as described above)?
Is it not possible to compare score of any any two queries?
Is it possible to do this in native Lucene APIs?
To answer your second question, scores of two documents must not be compared.
A similar question was posted in the java-users lucene mailing list.
Here's a link to it: Compare scores across queries
An explanation is given there as why one must not do that.
I'm not quite sure I'm clear on the queries you are referring to, but let's say the situation is something like this:
Doc A: Name = "Carlos Fernando Luís Maria Víctor Miguel Rafael Gabriel Gonzaga Xavier Francisco de Assis José Simão de Bragança, Sabóia Bourbon e Saxe-Coburgo-Gotha"
Doc B: Name = "Tomás António Gonzaga"
If you search for "gonzaga", Doc B will be given the higher score, since, while there is one match in each name, Doc B has a much shorter name, with only three terms, and shorter fields are weighed more heavily. This is the LengthNorm refered to in the TFIDFSimilarity documentation.
There are other factors though. If we just chuck each name into the queryparser, and see what comes up, something like:
Query queryA = queryparser.parse(docA.name);
Query queryB = queryparser.parse(docB.name);
Then the queries generated are much different:
name:carlos name:fernando name:luis name:maria name:victor name:miguel name:rafael name:gabriel name:gonzaga name:xavier name:francisco name:de name:assis name:jose name:simao name:de name:braganca name:baboia name:bourbon name:e name:saxe name:coburgo name:gotha
vs
name:tomas name:antonio name:gonzaga
there are a wealth of reasons why these would generate different scores. The lengthNorm discussed above, the coord factor, which boosts results which match more query terms would very likely come into play, tf, which weighs documents with more matches for a term more heavily, idf, which prefers terms that appear less frequently over the entire index, etc. etc.
Scores are only relevant to the result set of a query run. A change to the query, or to the state of the index can lead to different scores, and they are not intended to be comparable. You can use IndexSearcher.explain, to understand how a score was calculated.

Can Solr or ElasticSearch return same results in different orders to different visitors for the same search criteria?

I am developing a Spring-based website and I need to use a search engine to provide "customized" search results. I am considering Solr or Elastic.
Here is what I mean by "customized".
Suppose I have two fields A and B to search against.
Suppose that there are two visitors and I am able to profile them by tracking their activities. Suppose visitor 1 constantly uses or searches for value a (of A) and visitor 2 value b (of B). Now both visitors search for records that satisfy A=a OR B=b.
Can Solr or Elastic return results in different order for visitor 1 and 2? I mean that for example, results with A=a are ahead of only B=b results for visitor 1? And the opposite for visitor 2?
I understand that I need to pass some signal to a search engine to ask the engine to give more "weight" to one of the fields.
Thanks and Regards.
It looks like you just need to give a different weight to the fields you're querying on depending on the user that's executing the query.
You could for example use a multi_match query with elasticsearch, which allows you to search on multiple fields giving them different weights as well. Here is an example that makes the fieldA more important:
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "fieldA^3", "fieldB" ]
}
}
That way the score is influenced by the weights that you put on the query, and if you sort by score (default) you get the results in the expected order. The weights assigned to the fields need some fine-tuning though depending on your documents and the query you execute.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Apache Solr 4 most functional autosuggest component

Which one of the Solr components is the best:
TermsComponent
works well for us now but with limitations, ie:
- we can't print out the image for associated document in the same response
SpellCheckComponent
will have same limitations as TermsComponent
SearchComponent with NGrams
This one seems to be the step in the right direction but ran into a few limitations as well:
we'd like to be able to show all document grouped by doc type and suggest results in the following format:
Platforms
[IMG] XBOX (12)
[IMG] PS2 (9)
Category
Action - Fighting (20)
Action - Military (13)
Publisher
[IMG] Sony (20)
[IMG] Microsoft (13)
Games
[IMG] Halo 2
[IMG] Halo 3
suggest Real Product Name + Image + ID + Number of matches sorted by the weight.
Which is more likely to produce best results and minimize the load? We've got just under 25K documents
You should be able to do this with a combination of ngrams, and faceting. You would search against the ngrams to get the documents you want, then use the facet queries to output your results properly.
I wrote a blog post about making auto complete suggestions with Solr. Check it out, it might be useful! I wrote about the following different ways and the related pros and cons:
Facet using facet.prefix parameter
Ngrams
TermsComponent
Suggester
Unfortunately there isn't yet a complete solution ready to go, but the article can help you making the right choice depending on your requirements.
Since you want to show a complex result and not just words, you should consider using NGrams. It is actually the most flexible solution, and you can combine it with faceting as already mentioned in the other answer you got.

Resources