I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution
Related
I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.
How do you set up partial (substring) fuzzy match in Solr 4.2.1?
For example, if you have a list of US cities indexed, I would like a search term "Alber" to match "Alburquerque".
I have tried using the NGramFilterFactory on the <fieldType> and rebuilt the index but queries do not return results as expected - they still work as if I had just done the standard text_general defaults. Exact matches work, and explicit fuzzy searches would work given sufficient similarity (for example "Alberquerque~" with one misspelling would work.)
I did go to the analyzer tool in the Solr admin and saw that my ngrams were indeed being generated.
Is there something i'm missing from the query side?
Or should I take a different approach altogether?
And can this work with dismax? (Multiple fields indexed like this with different weights)
Thanks!
If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?
For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.
Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html
Hello stackOverflowers
I have a field in a Solr document collection with a field called
names_txt - this is a multiValue="true" field.
This field contains all the names of the associated persons to a document
I want to be able to both do a fuzzy search and at the same time limit the number of terms between the to matching terms.
The query
names_txt:("markus foss"~2)
Will return all documents where you find the terms markus and foss where theres max 2 terms between them.
But when i search in a fuzzy way AND want to also specify the max number of terms between the matches, I cant get the syntax right.
The query:
names_txt:(markus~0.7 foss~0.7)
This does work, but returns false postives, since it will return a document with "markus something" in one value, and "foss somethingElse" in another.
What I would like to write is:
(markus~0.7 foss~0.7)~2
but this syntax is illegal in solr.
Anyone out there have a solution for my problem?
Since in one single query term Solr can either process a word distance restraint or a fuzzy search restraint, we will need two terms for this:
names_txt:("markus foss"~2) AND names_txt:(markus~0.7 foss~0.7)
Note that quantifying fuzzyness by a float number is deprecated. Internally, lucene converts converts the float number to an int between 0 and 2 anyway, so we should use this integer (Damereau Levenshtein) edit distance right from the beginning in our search terms. So my final proposal states:
names_txt:("markus foss"~2) AND names_txt:(markus~1 foss~1)
(For those who are interested: The deprecated, somewhat quirky function that converts the similarity float to an edit distance int can be found at the end of this code file.)
I think you could do that using SpanQuery The issue is that the usual query parsers in Solr dont support them. Look at this article that mentions those that support spans: Surround, Xml-Query-Parser and Qsol. But check the status of each in current solr version.
Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.