Relevance feedback in Apache Solr - solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?

Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

Related

How to perform an exact search in Solr

I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Solr: where to store additional information?

I want to provide additional information per each indexed document during index time.
And access this information in the same analyzer during query time to compare it.
So. Theoretically it would be great to write this value into some field present in this document and at query time search this field also.
f.e. I have an animals db. I want to find all documents with 3 words 'dog' inside. (just an example). I can setup for my "animals" field my custom BaseTokenFilterFactory which will produce my custom TokenFilter which will just count all 'dog' words and store this number somewhere. So. Where I can store this value to access it at searching time?
Your example sounds like something which will be better suited to be handled by custom Similarity or a query function in Solr and not as a custom analyzer.
For example if using Solr 4.0 you can use the function termfreq(field,term) to order by the number of times dog appears. or you can use it as a filter like so:
fq={!frange l=3 u=100000}termfreq(animals,"dog")
This will filter all documents whose animals field doesn't have at least 3 occurrences of the word dog.
The advantage of using this method is that you don't affect the scoring of the documents only filter them.
The ability to filter by function exists since Solr 1.4 so even if you are using an earlier version of Solr (>1.4) you can easily write the "termfreq" function query yourself

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Resources