I have a case wherein I would like to match like this:
Query: abcd efgh ijkl mnop
After this the Query is subjected to NGram tokenizer and each word is split up into 2 gram tokens.
eg) The query is split up into,
ab,bc,cd,ef,fg,gh,ij,jk,kl,mn,no,op
Now while matching I want the minimum match to be customized for tokens in words.
I mean, By default when any one token corresponding to a word matches with the indexed document, with mm=1, that indexed document is returned. And if I give mm=2, then any one token from any 2 words need to match the indexed document to be returned.
But what I want is: Return a document only when any 'm' tokens each match for mm=num of words.
For example) I would want atleast 2 tokens each from atleast 3 words for the indexed document to be selected.
Seems IndexSearcher of Lucene does this core part. Do I need to change the code or any other config which would do the above stuff?
Thanks in advance...
This isn't exactly what you're asking for, but I'm guessing your underlying question is "how can I ensure that fuzzy searches only return things which are 'close' to the original query?"
The syntax foo~.8 does this - see the docs. Basically, .8 is a measure of the edit (Levenstein) distance divided by the length of the word.
If you want to stick to your idea of counting pairs which must match, you can do some math to figure out what the minimum levenstein distance needs to be.
Related
Is this correct behavior or do I need to do some extra setting?
I created custom filter where remove special characters and add synonyms
the search in Solr is a two step process, first the documents are matched, and second they are scored for ordering the results. The scoring takes in account how near are one term to the other, so if the order of the words in the query change, the scoring is affected.
if you omit the storing of word positions with omitPositions="true" in the field definition, and then the search should not be affected by the word order. In the Solr fields documentation you have many more options and how they affect the search described.
I would like to know, what does the '~' character mean in the following Solr query snippet:
... q="field:'value'~30^10 ...
~ is used to do Fuzzy search in this case.
the fuzzy query is based on Levenshtein Distance algo. This algo identifies minumun number of edits required to covert one token to another.
this is the syntax that is used:
q=field:term~N
where N is the edit distance. The value of N varies from 0 to 2.
If you do not specify anything for N, then a value of 2 is used as default.
N=2 -> This matches the highest number of edits.
N=0 -> This means no edit and would have same effect as term query.
You can give a fraction value between 0 and 1 but any fraction value greater then 1 will throw the following error.
org.apache.solr.search.SyntaxError: Fractional edit distances are not allowed!
Note: However giving a fraction values less then 1 also defaults to 2.
so q=field:term~0.2 will have the same effect as q=field:term~2
Also any distance greater then 2 will also default to 2.
so in the following case
q="field:value~30"
is same as (you can verify this by looking at debug query.)
q="field:value~2"
which will match the highest no. of edits.
Note:
the tilde in the fuzzy query is different then the proximity query. In a proximity query the tilde is applied after the quotation mark.
e.g below query
q=field:"foo bar"~30
So in your case when you are adding quotes around the field
q="field:'value'~30"
it is becoming proximity search, which really applies if you have two terms in the field. So it wont do much instead of just finding docs which have "value" set in "field".
In your example it means nothing - but if there were multiple words in your query, i.e. "foo bar"~30, it would mean "find foo and bar within 30 positions of each other". It allows you to give a phrase match a margin in regard to how close each term has to be to each other.
The ^10 part is telling Lucene how much to weight the phrase match compared to other parts of the query.
From the Lucene Query Parser Syntax description:
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10
I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.
In Apache Solr if I have two fields from two different documents:
field 1: "tom sawyer was a character in huckleberry finn"
field 2: "a character in huckleberry finn is tom sawyer"
*note that for simplicity the fields don't appear tokenized as shown here, but they are in the index
And I search for "a character in huckleberry finn," (also tokenized) will field 2 score higher because not only are the tokens in the same order in the field as they are in the query, but the position of the phrase in the text is at the beginning in both the field and in the query?
No. The positions are not used for computing the score, except for the positions in relation to each other if you use a phrase query. In your example, they're the same - so the score should be identical.
To avoid having a post for each similar question that you should have, it's probably better to refer to the Lucene Practical Scoring Formula which shows how the score is actually calculated for the TFIDF similarity. Remember that the similarity calculation is pluggable, so if you're using a different similarity, the calculation will be different.
These items are also simple to test by yourself - just index two documents with the text and issue a query with debugQuery set to true - and you'll see how each element contributes to the score.
I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.
You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2
You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.
Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.