Solr - How do I get the number of documents for each field containing the search term within that field in Solr? - solr

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.

Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.

you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.

if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)

SolrFaceting should solve your problem.
Have a look at the Examples.

This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Related

Is there a way to execute a query on SOLR where I have a list of words that need to be in different fields?

everybody. I'm trying to elaborate a query that complies with the following:
Find a set of words that appear in a group of fields. For example, i want to find the documents that have the words soccer, ball and goalkeeper in one or both fields: 'sport_name' and 'descritpion'.
The problem I'm having is that I need to treat both fields as only one for getting results like:
{
"sport_name":"soccer",
"description": "...played with a ball... positions are goalkeeper"
}
I need that the words appear in any field, but all the words need to appear in the "concatenated bigger field".
Is there a way to do this during query time?
Thanks!!
You can do this by using the edismax handler (defType=edismax), setting q.op=AND (since all the terms has to be present) and using qf=sport_name description to tell Solr to search for the given terms in both fields.
You can also use qf=sport_name^2 description to say that you want to weigh hits in the sport_name field twice as much as hits in the description field. So if there was a sport named something with ball, that hit would contribute more to the score than if the same content were present in the description field.

Solr edismax. How to retrieve the fields that gave the query result

I'm querying mulitple fields using the qf param. But I'm wondering how I can retrieve the field that gave the result.
Example(not a real example):
q={!edismax qf='tag content'}("tablet")AND("pc")
Field values:
doc1:
tag: tablet
content: The test has failled. Use a pc instead.
doc2:
tag: tablet pc
content: The test has worked.
As a result both documents will be returned because they both have the tablet and pc in their tag/content. Is it possible to know that doc2 had both hits in tag and doc1 only had 1 hit in tag and 1 in content? Debugquery doesn't seem to provide information about this.
I know I can increase the importance of a searchfield using the qf boost.
you can either:
use Highlighting
ask for debug info with debugQuery=true and parse the scoring info to find out.
I think 1 is easier, but it imposes some constraints on your fields (they must be stored for example)
Ok based on your response to my question.
Unfortunately, there is no way Solr currently returns which fields matched the query as part of your defaulting result docs . If the query is simple, looping over the returned stored fields is probably your best bet. Highlighting is be an option too.
There are couple of other options suggested here : http://grokbase.com/t/lucene/solr-user/117nkf36nq/determine-which-field-term-was-found

Searching wiki URLs using Solr

I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?
One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Solr Index appears to be valid - but returns no results

Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.

Resources