Solr query string not working for full text searches - solr

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?

No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Related

No matches when mixing keywords

I am trying to do a product search setup using Solr. It does return results for keywords that follow the same order in the product name. However, when the keywords are mixed up, no results are returned. I would like to get results with scores that closely match the given keywords in any order.
My question on scoring has the schema, data configuration and query. Any help will be greatly appreciated.
As long as you enter your query as a regular query, instead of using wildcards, any hits in a text_general field as you've defined should be returned.
You can use the mm parameter to adjust how many of the terms supplied that need to match from a query. I suggest using the edismax query parser, as that allows you do to more "natural" queries instead of having to add the fieldnames in the query itself:
defType=edismax&qf=catchall&q=nikon dslr
defType=edismax&qf=catchall&q=dslr nikon
should both give the same set of documents (but possibly different scores when using phrase boosts).

What fieldtype to choose and how to look my query

The problem is this: I've got a column (named name)which consist of names for Example "Иван Кирилов Петров", "Нина Семова Мариножа" and so on.
So I want to make a query which will get all the names that has first name 'Иван' and last name 'Петров'; The second name doesn't matter so i will put * wildcard character.
Also there is a bigger problem: I should be able in a case if the user writes "Иван Кирилов Петров" to find this exact person
what I have tried :
I made the field text_ws type
and tested the following queries:
q=name:Иван*Петров
perfect - it finds what I want - all the names with first Иван and last Петров;
But then i want to find Иван Кирилов Петров i get no response because I want to make an exact search and my type should be string
How can I solve this!
Try adding autoGeneratePhraseQueries="true" flag on your text_ws type definition. And use debugQuery=true flag to see how it does the matches against the field. If the basic thing work, you can then look at pf3 flag in eDismax configuration to boost the query matches.
Solr also comes with dedicated Token Filters for Russian, but you probably don't care about that for the people's names.
I don't think you need a wild-card query. If you are only splitting on white-space during index time (text_ws) and you get complete first, last and/or middle names for query, you can do an AND query like
q=name:(Иван AND Петров)
or
q=name:(ИВАН AND МИНЧЕВ AND ПЕТРОВ)
Update: After your comment, I see that this will do a bag-of-words search and won't preserve the order. I guess you need to keep a string copy field of name, say name_str, which will give you more search options. For example, if there are 2 spaces in the query, meaning you get the first, middle and last names, then you can do an exact match on name_str like
q=name_str:"ИВАН%20МИНЧЕВ%20ПЕТРОВ"
If you are using Solr 4.0 and above, then regex query on the string field can help you. You can do
q=name_str:/ИВАН.*ПЕТРОВ/
will match anything that begins with ИВАН and ends with ПЕТРОВ.
or even
q=name_str:/Иван.*?Кирилов.*?Петров/
Unfortunately, there is no Solr wiki page on regex search yet, but you can google around.
You need to distinguish between the different types of queries you want to do and do different searches. Maybe give a check-box to your users asking if they want an exact match or not.

Solr Index appears to be valid - but returns no results

Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.

How do I return only a truncated portion of a field in SOLR?

I have a really large (5000+ characters) text field in SOLR named Description. So far it works great for searching and highlighting. If I perform a search and there are no highlighted portions then I just show the first 300 characters. What I would like to do is just return the 300 characters in the result from SOLR.
I would like to do this because when testing I get improved performance if I return a smaller result. This is probably because the XML doc is smaller so less time on the wire and then the processing is faster because the doc is smaller.
I have thought of using a new field that just stored the first 300 characters. I think this would work, but I was wondering if there was a better or more native solution.
What you're looking for is the highlighting hl.maxAlternateFieldLength (http://wiki.apache.org/solr/HighlightingParameters#hl.maxAlternateFieldLength).
You will need to define the field as its own alternate field. If you want to highlight the field Description, the highlight query parameters would be:
hl=true
hl.fl=Description
f.Description.hl.alternateField=Description
hl.maxAlternateFieldLength=300
Finally, to omit the Description field from the query result, you will have to exclude it from the fl query parameter:
fl=score,url,title,date,othermetadata
When using the Unified Highlighter, hl.alternateField is not available as a query parameter. Instead you can use the hl.defaultSummary query parameter (available since Solr 4.5)
hl.defaultSummary
If true, use the leading portion of the text as a snippet if a proper highlighted snippet can’t otherwise be generated. The default is false.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources