word proximity not working in apache solr

word proximity not working in apache solr - solr

I am using dismax parser to boost phrase queries like following
qf=story_title^5.0+tax_payer_name+judgement_text^1.0+story_description^1.0+tax_payer_name+nature_of_the_issues+decision_summary+additional_comments+facts_of_the_case+section_number';
pf=story_title^5.0+&pf=judgement_text+story_description^1+nature_of_the_issues+decision_summary+additional_comments+facts_of_the_case+section_number';
qs=3';
ps=3';
but whenever i search like 54F beed registration , some results come up where , there are more registration word recurring and not 54F beed registration
Somewhere i found that solr score depends on percentage of word repeating in document
how can we override this behavior to achieve desired results in solr?
Thanks in advance.

I don't think there's an omitTermFreq setting yet, even if it has been mentioned many times.
A possible solution is to create your own similarity class by subclassing DefaultSimilarity, and returning 1.0f as the tf value.
See Solr Custom Similarity for an on how to implement a custom similarity class. Recent versions of Solr (4.0+) supports a custom similarity class per field.

Related

Adding Boost to Score According to Payload of Multivalued Field at Solr

Here is my case;
I have a field at my schema named elmo_field. I want that elmo_field should have payloaded values. i.e.
dorothy|0.46 sesame|0.37 big bird|0.19 bird|0.22
When a user searches for a keyword i.e. dorothy I want to add 0.46 to usual score. If user searches for big bird, 0.19 should be added and if user searches for bird, 0.22 should be added (payloads are added - or payloads * normalize coefficient will be added).
I mean I will make a search on my index at my other fields of solr schema. And I will make another search (this one is an exact match search) at elmo_field at same time and if matches something I will increase score with payloads.
Any ideas?

I've implemented a custom similarity wrapper. For usual things I've used DefaultSimilarity. If a field is a payloaded field another similarity that is implemented by me is used. That similarity class just ignores payload value. I've also implemented a query parser that is a customized version of edismax. With that approach I could add payload value into the document score.

Have you looked at CustomScoreQuery?
There's an example with some explanation how to do this at http://dev.fernandobrito.com/2012/10/building-your-own-lucene-scorer/

You could do a boost on a query as this question suggests: How to assign a weight to a term query in Lucene/Solr
Or you could try using payloads as described here:
http://searchhub.org/2009/08/05/getting-started-with-payloads/

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?

Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

How do I override Solr's relevancy in a query

I am integrating a chemical structure search with Solr. To that end I am creating a Solr plugin.
The structure search returns the structure_id and it's score. Scores are values between 100 and 0 (probably would never see a 0)
I use this to create a Solr query to pull all documents that have the structure_ids. I want the results of the search to be ordered by the structure search score, not the Solr relevancy.
I generate a query that looks like this:
+structure_id:(28760263^95 OR 30392284^82 OR 47390042^70)
The problem is that in my trivial test case Solr is returning the records matching the structure_id 28760263 last. It has assigned it the lowest relevancy (4.6609402E-6)!
I wrote a function to basically amplify the score by a lot and that apparently does fix the problem however I don't think that the amplification should be necessary.
I am using Solr 3.5.
Is there some configuration that I am missing? Currently I am using Solr pretty much out of the box. The only things I've changed is to add my plugin and I edited the example docs to add structure_ids for my test case.
Is there a way to completely override the lucene scoring with the score from the structure search? We have other reasons why we would like to take control of Solr's scoring and knowing how to do that would be useful

Developing custom facet calculations in SOLR

I'm looking into using Solr for a project where we have some specific faceting requirements. From what I've learned, Solr provides range-based facets, where Solr can provide facets of different value-ranges or date-ranges, e.i. field values are "grouped" and aggregated into different bins.
I would like to do something similar, but I want to create a custom function that maps field values to my specific facets, so that each field value is evaluated using a function to see which facet it belongs to.
myFacet = myFacetMapper(fieldValue)
Its sort of a more advanced version of range-facets, but where values are mapped using a custom function rather than just into different bins.
Does anyone know if this is possible and where to start?

I would look into using SimpleFacets to implement your logic. Then you embed it inside a SearchComponent, that you can register into your solrconfig. Look at the code of FacetComponent for an example.

Create another field with value = myFacetMapper(field) , then do normal faceting on that field.

Use different Solr Similarity algo for every search

Is possible in Solr 1.4 to specify which similarity class to use for every search within a single index?
Let's say, I got 2 type of search (keyword and brand). For keyword search, I want to use the DefaultSimilarity class. But, for brand search, I want to use my CustomSimilarity class.
I've been modifying the schema.xml to specify a single similarity class to use. But, I came to this requirement that I have to use 2 different similarity classes.
I'll be glad to here your thoughts on this.
Thanks in advance.

AFAIK the Similarity can only be defined at the schema/index level and can't be overriden per fieldType or per query. (see this and this).
However you can customize your result ordering using other methods: boosting, function queries, a custom analyzer per field, or even sorting.
The Solr Relevancy Cookbook wiki is a good reference.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight