How do I override Solr's relevancy in a query - solr

I am integrating a chemical structure search with Solr. To that end I am creating a Solr plugin.
The structure search returns the structure_id and it's score. Scores are values between 100 and 0 (probably would never see a 0)
I use this to create a Solr query to pull all documents that have the structure_ids. I want the results of the search to be ordered by the structure search score, not the Solr relevancy.
I generate a query that looks like this:
+structure_id:(28760263^95 OR 30392284^82 OR 47390042^70)
The problem is that in my trivial test case Solr is returning the records matching the structure_id 28760263 last. It has assigned it the lowest relevancy (4.6609402E-6)!
I wrote a function to basically amplify the score by a lot and that apparently does fix the problem however I don't think that the amplification should be necessary.
I am using Solr 3.5.
Is there some configuration that I am missing? Currently I am using Solr pretty much out of the box. The only things I've changed is to add my plugin and I edited the example docs to add structure_ids for my test case.
Is there a way to completely override the lucene scoring with the score from the structure search? We have other reasons why we would like to take control of Solr's scoring and knowing how to do that would be useful

Related

Solr- Find "Significant Terms" on Subset of Documents

I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:
http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000
and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.
Thanks,
Adrian
It is one the works I have in my backlog[1].
What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus).
Solr won't do that out of the box, but you can work on it.
Elastic Search has a module for that you can inspiration from[2]
[1] https://issues.apache.org/jira/browse/SOLR-9851
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

word proximity not working in apache solr

I am using dismax parser to boost phrase queries like following
qf=story_title^5.0+tax_payer_name+judgement_text^1.0+story_description^1.0+tax_payer_name+nature_of_the_issues+decision_summary+additional_comments+facts_of_the_case+section_number';
pf=story_title^5.0+&pf=judgement_text+story_description^1+nature_of_the_issues+decision_summary+additional_comments+facts_of_the_case+section_number';
qs=3';
ps=3';
but whenever i search like 54F beed registration , some results come up where , there are more registration word recurring and not 54F beed registration
Somewhere i found that solr score depends on percentage of word repeating in document
how can we override this behavior to achieve desired results in solr?
Thanks in advance.
I don't think there's an omitTermFreq setting yet, even if it has been mentioned many times.
A possible solution is to create your own similarity class by subclassing DefaultSimilarity, and returning 1.0f as the tf value.
See Solr Custom Similarity for an on how to implement a custom similarity class. Recent versions of Solr (4.0+) supports a custom similarity class per field.

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?
Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

Solr get calculated distance while using dismax

I'm starting to think that what I want to do is not possible but thought I would give this a try.
I'm running Solr 3.5.
I currently have two types of search:
A basic spatial query which returns the calulated distance between two points in the score field.
Sample Query from my Solr logs:
?fl=*,score&sort=score+asc&start=0&q={!func}geodist()&sfield=coordinates&pt=59.2363514,18.092783&version=2
A dismax query which allows free text queries on a number of fields.
Sample Query from Solr log:
mm=1&d=100.0&sfield=coordinates&qf=field1^5.0+fields2^3.0&defType=edismax&version=2&fl=*,score&start=1&q=monkeyhopper&pt=59.2363514,18.0927830000&fq={!geofilt}}
I want to replace my first query with the dismax query but I really need to get the calculated distance in the response. Yes, I can calulate the distance programatically but I would prefer not having to do this as Solr has done it for me already.
I still want to be able to sort my dismax query "by relevance", distance or any other field so the score given by my boosts could be interesting for sorting but I don't need it to be returned.
If I understood correctly you want to have the result of a function in your Solr response. The SOLR-2444 issue is what you're looking for I guess: it allows to include in the fl parameter pseudo-fields, functions etc. The only problem is that it's been committed only on trunk, so it isn't available on the current Solr release, neither will be in the coming 3.6 release. You have to wait for the 4 release but I don't think it will take a lot of time. Maybe you can already start playing around with a snapshot of the last successful Jenkins build.
Pseudo-fields are now available in Solr 4+ which allow you to do just this.
http://localhost:8983/solr/collection1/browse?q=*:*&rows=1000&wt=xml&pt=37.763649,-122.24313&sfield=store&fl=dist:geodist()
For instance, this request allows me to return a field "dist" which contains the distance of each entry to the stated point.

How to sort by tag considering the tags weights related to every document?

I'm building up a Solr search engine to search on a 300k documents collection. Among the many indexed fields, an important one is tags.
My idea is to assign to every document a vector of tags, each one with a given weight (basically depending on the number of users who chose that tag for that document). For instance
Doc1 = {tag1:0.3, tag2:0.7, tag3:0.8, tag4:1}
Doc2 = {tag2:0.5, tag3:0.8, tag4:0.8, tag5=0.9}
Using this example, when someone ask for documents tagged with tag4, I would give back both the documents of course, but Doc1 with an highest score since it has tag4 weighted higher.
Ideally, the way to implement this on Solr, would be something like creating a multiValued field called "tags", and assign at indexing time a weight to each tag contained in such a field. So, first question:
Is it possible to assign a term frequency (as a tag weigth) manually at indexing time?
To what I found... seems not! Ok... a workaround is to copy for instance tag4 10 times on the tags field of Doc1 and just 8 on the tags field of Doc2. Of course has some drawbacks and limitations.
However here comes the bigger problem I cannot solve even with a workaround. I would like to define my own score. The one that fit better my specific case would be something like sort=tf(tags,tag4). In fact TF is in this case much more important than IDF! Unfortunately this feature (Relevance Functions) will be released just in Solr 4: http://wiki.apache.org/solr/FunctionQuery#tf
Have you got any idea about how to change the scoring function in Solr 3.5 giving more importance to TF and less to IDF?
Is there any hack to do it simply, or would you change the Lucene source code (if yes... what and where?), or would you use the Solr4 night build?
Thanks in advance for your advices!

Resources