I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.
Related
I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.
In Apache Solr if I have two fields from two different documents:
field 1: "tom sawyer was a character in huckleberry finn"
field 2: "a character in huckleberry finn is tom sawyer"
*note that for simplicity the fields don't appear tokenized as shown here, but they are in the index
And I search for "a character in huckleberry finn," (also tokenized) will field 2 score higher because not only are the tokens in the same order in the field as they are in the query, but the position of the phrase in the text is at the beginning in both the field and in the query?
No. The positions are not used for computing the score, except for the positions in relation to each other if you use a phrase query. In your example, they're the same - so the score should be identical.
To avoid having a post for each similar question that you should have, it's probably better to refer to the Lucene Practical Scoring Formula which shows how the score is actually calculated for the TFIDF similarity. Remember that the similarity calculation is pluggable, so if you're using a different similarity, the calculation will be different.
These items are also simple to test by yourself - just index two documents with the text and issue a query with debugQuery set to true - and you'll see how each element contributes to the score.
Let's say I have a query like this:
text_data:(Apple OR Apple~2)
How do I know what boost value to provide to give the direct match a clear priority over the fuzzy match?
You can't really guarantee a clear priority as the fuzzy search will naturally match on more terms (Apple, Appl, App, Appla and so on). Just give it a high enough boost value that it will outscore the fuzzy search in everything but edge cases. The fuzzy search will also help you out by scoring an exact match for 'Apple' higher than any matches that have deletions or substitutions
text_data:(Apple^10 OR Apple~2)
Will multiply 10 into the normal score for Apple search term
I'm using character proximity to allow for some misspellings, for example:
text:manager~1
This allows both 'manager' and 'managre' to be matched. The problem is, the misspellings are always ranked higher than the proper spelling because there are fewer of those in the index. For example, let's say I have 3 documents as follows:
1) text:manager
2) text:manager
3) text:managre
Then the character proximity query above will give an inverse document frequency (idf) of 1.7 to 'managre' and 1.2 to 'manager', effectively ranking the misspelled 'managre' higher. From a technical perspective, this makes sense (there are fewer occurances of 'managre' than 'manager'), but in reality, this doesn't make sense. Is there a way to get Solr to set the idf of misspelled words to match that of the correct spelling?
Short answers is No. Long answer is you have good options here, You need to solve this in a different way.
To begin with take the power of query time boosting. So you can query something like:
text:manager^1.2 OR text:manager~1^0.8
Here you are saying my user is smart so i will give higher boost to user query, but just incase I will give it's variance a bit lower boost. You need to do a boolean query of exact match with higher boost with a Boolean OR query of fuzzy query so that exact matches ranks higher. Do not worry about extra work for solr. It is built for very complex Lucene query trees. Using a combination of queries to get expected relevancy ranking is common practice.
TF , IDF and solr's in built relevancy ranking arbitrary and framing query with boosts, boolean queries, and context based filters is where power and flexibility of solr exists.
I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.
You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2
You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.
Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.