Which one of the Solr components is the best:
TermsComponent
works well for us now but with limitations, ie:
- we can't print out the image for associated document in the same response
SpellCheckComponent
will have same limitations as TermsComponent
SearchComponent with NGrams
This one seems to be the step in the right direction but ran into a few limitations as well:
we'd like to be able to show all document grouped by doc type and suggest results in the following format:
Platforms
[IMG] XBOX (12)
[IMG] PS2 (9)
Category
Action - Fighting (20)
Action - Military (13)
Publisher
[IMG] Sony (20)
[IMG] Microsoft (13)
Games
[IMG] Halo 2
[IMG] Halo 3
suggest Real Product Name + Image + ID + Number of matches sorted by the weight.
Which is more likely to produce best results and minimize the load? We've got just under 25K documents
You should be able to do this with a combination of ngrams, and faceting. You would search against the ngrams to get the documents you want, then use the facet queries to output your results properly.
I wrote a blog post about making auto complete suggestions with Solr. Check it out, it might be useful! I wrote about the following different ways and the related pros and cons:
Facet using facet.prefix parameter
Ngrams
TermsComponent
Suggester
Unfortunately there isn't yet a complete solution ready to go, but the article can help you making the right choice depending on your requirements.
Since you want to show a complex result and not just words, you should consider using NGrams. It is actually the most flexible solution, and you can combine it with faceting as already mentioned in the other answer you got.
Related
I am struggling with a little problem where I have to display relevant information about the resultset returned from SolR but can't figure out how to calculate it without iterating the results (bad).
Basically I am storing my documents with a state field and while the search is supposed to return all documents, the UI has to show "Found 15 entities, 5 are in state A, 3 in state B and 8 in C".
At the moment I am using a rather brittle approach of running the query 3 times with additional scoping by type, but I'd rather get that information from the one query I am displaying. (There have been some edge cases where the numbers don't add up and since SolR can return facets I guess there has to be a way to use that functionality in this case)
I am using SolR 3.5 from Rails with the sunspot gem
As you mention yourself, you can use facets for this by setting
facet=true&facet.field=state
I'm not familiar with the sunspot gem, but by looking at the documentation you can use
facets like this(Assuming Entity is your searchable):
Entity.search do:
facet :state
end
This should return the states of all entities returned by your query with the number of entities in this state. The Sunspot documentation tells me you can read these facets in the following way:
search.facet(:state).rows.each do |facet|
puts "State #{facet.value} has #{facet.count} entities"
end
Essentially there are three main sets of functions you can use to garner stats from solr.
The first is faceting:
http://wiki.apache.org/solr/SimpleFacetParameters
There is also grouping (field collapsing):
https://wiki.apache.org/solr/FieldCollapsing
And the stats package:
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
Although the stats, facet and group may be replaced by the analytic package known as olap which is aimed to be in solr V 5.0.0:
https://issues.apache.org/jira/browse/SOLR-5302
Good luck.
I have people indexed into solr based on documents that they have authored. For simplicity's sake, let's say they have three fields - an integer ID, a Text field and a floating point 'SpecialRank' (a value between 0 and 1 to indicate how great the person is). Relevance matching in solr is all done through the Text field. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own SpecialRank. Namely, I need to re-rank the results based on the following formula:
finalScore = (0.8 * solrScore) + (0.2 * SpecialScore)
As far as I'm aware, this is a common task in information retrieval, as we are just combining two different scores in a weighted manner. The trouble is, I need solrScore to be normalized for this to work. What I have been doing is normalizing the solrScore based on the maxScore for a particular query and re-ranking the results client-side. This has been working OK, but means I have to retrieve all the matching documents from solr before I do my re-ranking.
I am looking for the best way to have solr take care of this re-ranking. Are boost functions able to help here? I have read that they can be multiplicative or additive to the solr score, but since the solr score is not normalized and all over the place depending on different queries, this doesn't really seem to solve my problem. Another approach I have tried is to first query solr for a single document just to get the maxScore, and then use the following formula for the sort:
sum(product(0.8,div(score,maxScore)),product(0.2,SpecialRank))+desc
This, of course, doesn't work as you're unable to use the score as a variable in a sort function.
Am I crazy here? Surely this is a common enough task in IR. I've been banging my head against the wall for a while now, any ideas would be much appreciated.
You could try to implement custom SearchComponent that will go trough results on Solr and calculate your custom score there. Get results found from ResponseBuilder (rb.getResults().docSet), iterate trough them, add calculated value to your results and re-sort them.
You can then register your SearchComponent as last in RequestHandler chain:
<arr name="last-components">
<str>elevator</str>
</arr>
More info in SolR manual:
http://wiki.apache.org/solr/SearchComponent
Sorry, but no better idea for now.
I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?
Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.
Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
I'm building up a Solr search engine to search on a 300k documents collection. Among the many indexed fields, an important one is tags.
My idea is to assign to every document a vector of tags, each one with a given weight (basically depending on the number of users who chose that tag for that document). For instance
Doc1 = {tag1:0.3, tag2:0.7, tag3:0.8, tag4:1}
Doc2 = {tag2:0.5, tag3:0.8, tag4:0.8, tag5=0.9}
Using this example, when someone ask for documents tagged with tag4, I would give back both the documents of course, but Doc1 with an highest score since it has tag4 weighted higher.
Ideally, the way to implement this on Solr, would be something like creating a multiValued field called "tags", and assign at indexing time a weight to each tag contained in such a field. So, first question:
Is it possible to assign a term frequency (as a tag weigth) manually at indexing time?
To what I found... seems not! Ok... a workaround is to copy for instance tag4 10 times on the tags field of Doc1 and just 8 on the tags field of Doc2. Of course has some drawbacks and limitations.
However here comes the bigger problem I cannot solve even with a workaround. I would like to define my own score. The one that fit better my specific case would be something like sort=tf(tags,tag4). In fact TF is in this case much more important than IDF! Unfortunately this feature (Relevance Functions) will be released just in Solr 4: http://wiki.apache.org/solr/FunctionQuery#tf
Have you got any idea about how to change the scoring function in Solr 3.5 giving more importance to TF and less to IDF?
Is there any hack to do it simply, or would you change the Lucene source code (if yes... what and where?), or would you use the Solr4 night build?
Thanks in advance for your advices!