I would like to know, what does the '~' character mean in the following Solr query snippet:
... q="field:'value'~30^10 ...
~ is used to do Fuzzy search in this case.
the fuzzy query is based on Levenshtein Distance algo. This algo identifies minumun number of edits required to covert one token to another.
this is the syntax that is used:
q=field:term~N
where N is the edit distance. The value of N varies from 0 to 2.
If you do not specify anything for N, then a value of 2 is used as default.
N=2 -> This matches the highest number of edits.
N=0 -> This means no edit and would have same effect as term query.
You can give a fraction value between 0 and 1 but any fraction value greater then 1 will throw the following error.
org.apache.solr.search.SyntaxError: Fractional edit distances are not allowed!
Note: However giving a fraction values less then 1 also defaults to 2.
so q=field:term~0.2 will have the same effect as q=field:term~2
Also any distance greater then 2 will also default to 2.
so in the following case
q="field:value~30"
is same as (you can verify this by looking at debug query.)
q="field:value~2"
which will match the highest no. of edits.
Note:
the tilde in the fuzzy query is different then the proximity query. In a proximity query the tilde is applied after the quotation mark.
e.g below query
q=field:"foo bar"~30
So in your case when you are adding quotes around the field
q="field:'value'~30"
it is becoming proximity search, which really applies if you have two terms in the field. So it wont do much instead of just finding docs which have "value" set in "field".
In your example it means nothing - but if there were multiple words in your query, i.e. "foo bar"~30, it would mean "find foo and bar within 30 positions of each other". It allows you to give a phrase match a margin in regard to how close each term has to be to each other.
The ^10 part is telling Lucene how much to weight the phrase match compared to other parts of the query.
From the Lucene Query Parser Syntax description:
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10
Related
I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.
I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.
You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2
You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.
Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.
I have a case wherein I would like to match like this:
Query: abcd efgh ijkl mnop
After this the Query is subjected to NGram tokenizer and each word is split up into 2 gram tokens.
eg) The query is split up into,
ab,bc,cd,ef,fg,gh,ij,jk,kl,mn,no,op
Now while matching I want the minimum match to be customized for tokens in words.
I mean, By default when any one token corresponding to a word matches with the indexed document, with mm=1, that indexed document is returned. And if I give mm=2, then any one token from any 2 words need to match the indexed document to be returned.
But what I want is: Return a document only when any 'm' tokens each match for mm=num of words.
For example) I would want atleast 2 tokens each from atleast 3 words for the indexed document to be selected.
Seems IndexSearcher of Lucene does this core part. Do I need to change the code or any other config which would do the above stuff?
Thanks in advance...
This isn't exactly what you're asking for, but I'm guessing your underlying question is "how can I ensure that fuzzy searches only return things which are 'close' to the original query?"
The syntax foo~.8 does this - see the docs. Basically, .8 is a measure of the edit (Levenstein) distance divided by the length of the word.
If you want to stick to your idea of counting pairs which must match, you can do some math to figure out what the minimum levenstein distance needs to be.
I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.
Can someone confirm that within a Free Text search that the query Word1 NEAR Word2 is identical to Word2 NEAR Word1 ?
So that Word order is not relevant.
I am trying to highlight the results and if this is the case I need to look for occurrences of the reversal of the original search term words.
I've done a quick test on a database I have with a free-text index and the results of the query don't appear to vary depending on the order of words in the NEAR query. In other words the following two queries returned the exact same results in the same order:
SELECT * FROM DOCUMENT WHERE CONTAINS (Contents, 'health NEAR medical')
SELECT * FROM DOCUMENT WHERE CONTAINS (Contents, 'medical NEAR health')
So I would conclude there is no difference. This is backed up by the documentation that states:
"NEAR indicates the logical distance
between terms, rather than the
absolute distance between them. For
example, terms within different
phrases or sentences within a
paragraph are treated as farther apart
than terms in the same phrase or
sentence, regardless of their actual
proximity, on the assumption that they
are less related. Likewise, terms in
different paragraphs are treated as
being even farther apart."
Given that distance between two words will always be the same, regardless of order, then I can't see it would make any difference and my tests back this up.
It seems obvious when written out - thanks for running the test.
As far as I can tell, the only measure of the "nearness" of the 2 words is that the FTS rank value of 1 equates to a 50-word difference.
One assumes that a rank of 100 indicates there is no word gap, i.e. the words are consecutive.
SELECT fld_Description
FROM tbl_ProductDescription
WHERE CONTAINS(fld_Description , 'bike NEAR performance');