Solr- Find "Significant Terms" on Subset of Documents - solr

I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:
http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000
and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.
Thanks,
Adrian

It is one the works I have in my backlog[1].
What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus).
Solr won't do that out of the box, but you can work on it.
Elastic Search has a module for that you can inspiration from[2]
[1] https://issues.apache.org/jira/browse/SOLR-9851
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

Related

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?
Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

How do I override Solr's relevancy in a query

I am integrating a chemical structure search with Solr. To that end I am creating a Solr plugin.
The structure search returns the structure_id and it's score. Scores are values between 100 and 0 (probably would never see a 0)
I use this to create a Solr query to pull all documents that have the structure_ids. I want the results of the search to be ordered by the structure search score, not the Solr relevancy.
I generate a query that looks like this:
+structure_id:(28760263^95 OR 30392284^82 OR 47390042^70)
The problem is that in my trivial test case Solr is returning the records matching the structure_id 28760263 last. It has assigned it the lowest relevancy (4.6609402E-6)!
I wrote a function to basically amplify the score by a lot and that apparently does fix the problem however I don't think that the amplification should be necessary.
I am using Solr 3.5.
Is there some configuration that I am missing? Currently I am using Solr pretty much out of the box. The only things I've changed is to add my plugin and I edited the example docs to add structure_ids for my test case.
Is there a way to completely override the lucene scoring with the score from the structure search? We have other reasons why we would like to take control of Solr's scoring and knowing how to do that would be useful

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

How to sort by tag considering the tags weights related to every document?

I'm building up a Solr search engine to search on a 300k documents collection. Among the many indexed fields, an important one is tags.
My idea is to assign to every document a vector of tags, each one with a given weight (basically depending on the number of users who chose that tag for that document). For instance
Doc1 = {tag1:0.3, tag2:0.7, tag3:0.8, tag4:1}
Doc2 = {tag2:0.5, tag3:0.8, tag4:0.8, tag5=0.9}
Using this example, when someone ask for documents tagged with tag4, I would give back both the documents of course, but Doc1 with an highest score since it has tag4 weighted higher.
Ideally, the way to implement this on Solr, would be something like creating a multiValued field called "tags", and assign at indexing time a weight to each tag contained in such a field. So, first question:
Is it possible to assign a term frequency (as a tag weigth) manually at indexing time?
To what I found... seems not! Ok... a workaround is to copy for instance tag4 10 times on the tags field of Doc1 and just 8 on the tags field of Doc2. Of course has some drawbacks and limitations.
However here comes the bigger problem I cannot solve even with a workaround. I would like to define my own score. The one that fit better my specific case would be something like sort=tf(tags,tag4). In fact TF is in this case much more important than IDF! Unfortunately this feature (Relevance Functions) will be released just in Solr 4: http://wiki.apache.org/solr/FunctionQuery#tf
Have you got any idea about how to change the scoring function in Solr 3.5 giving more importance to TF and less to IDF?
Is there any hack to do it simply, or would you change the Lucene source code (if yes... what and where?), or would you use the Solr4 night build?
Thanks in advance for your advices!

Solr: where to store additional information?

I want to provide additional information per each indexed document during index time.
And access this information in the same analyzer during query time to compare it.
So. Theoretically it would be great to write this value into some field present in this document and at query time search this field also.
f.e. I have an animals db. I want to find all documents with 3 words 'dog' inside. (just an example). I can setup for my "animals" field my custom BaseTokenFilterFactory which will produce my custom TokenFilter which will just count all 'dog' words and store this number somewhere. So. Where I can store this value to access it at searching time?
Your example sounds like something which will be better suited to be handled by custom Similarity or a query function in Solr and not as a custom analyzer.
For example if using Solr 4.0 you can use the function termfreq(field,term) to order by the number of times dog appears. or you can use it as a filter like so:
fq={!frange l=3 u=100000}termfreq(animals,"dog")
This will filter all documents whose animals field doesn't have at least 3 occurrences of the word dog.
The advantage of using this method is that you don't affect the scoring of the documents only filter them.
The ability to filter by function exists since Solr 1.4 so even if you are using an earlier version of Solr (>1.4) you can easily write the "termfreq" function query yourself

Resources