SQL-Server Free Text Search - NEAR operator - sql-server

Can someone confirm that within a Free Text search that the query Word1 NEAR Word2 is identical to Word2 NEAR Word1 ?
So that Word order is not relevant.
I am trying to highlight the results and if this is the case I need to look for occurrences of the reversal of the original search term words.

I've done a quick test on a database I have with a free-text index and the results of the query don't appear to vary depending on the order of words in the NEAR query. In other words the following two queries returned the exact same results in the same order:
SELECT * FROM DOCUMENT WHERE CONTAINS (Contents, 'health NEAR medical')
SELECT * FROM DOCUMENT WHERE CONTAINS (Contents, 'medical NEAR health')
So I would conclude there is no difference. This is backed up by the documentation that states:
"NEAR indicates the logical distance
between terms, rather than the
absolute distance between them. For
example, terms within different
phrases or sentences within a
paragraph are treated as farther apart
than terms in the same phrase or
sentence, regardless of their actual
proximity, on the assumption that they
are less related. Likewise, terms in
different paragraphs are treated as
being even farther apart."
Given that distance between two words will always be the same, regardless of order, then I can't see it would make any difference and my tests back this up.

It seems obvious when written out - thanks for running the test.
As far as I can tell, the only measure of the "nearness" of the 2 words is that the FTS rank value of 1 equates to a 50-word difference.
One assumes that a rank of 100 indicates there is no word gap, i.e. the words are consecutive.

SELECT fld_Description
FROM tbl_ProductDescription
WHERE CONTAINS(fld_Description , 'bike NEAR performance');

Related

Solr boosting and '~' character

I would like to know, what does the '~' character mean in the following Solr query snippet:
... q="field:'value'~30^10 ...
~ is used to do Fuzzy search in this case.
the fuzzy query is based on Levenshtein Distance algo. This algo identifies minumun number of edits required to covert one token to another.
this is the syntax that is used:
q=field:term~N
where N is the edit distance. The value of N varies from 0 to 2.
If you do not specify anything for N, then a value of 2 is used as default.
N=2 -> This matches the highest number of edits.
N=0 -> This means no edit and would have same effect as term query.
You can give a fraction value between 0 and 1 but any fraction value greater then 1 will throw the following error.
org.apache.solr.search.SyntaxError: Fractional edit distances are not allowed!
Note: However giving a fraction values less then 1 also defaults to 2.
so q=field:term~0.2 will have the same effect as q=field:term~2
Also any distance greater then 2 will also default to 2.
so in the following case
q="field:value~30"
is same as (you can verify this by looking at debug query.)
q="field:value~2"
which will match the highest no. of edits.
Note:
the tilde in the fuzzy query is different then the proximity query. In a proximity query the tilde is applied after the quotation mark.
e.g below query
q=field:"foo bar"~30
So in your case when you are adding quotes around the field
q="field:'value'~30"
it is becoming proximity search, which really applies if you have two terms in the field. So it wont do much instead of just finding docs which have "value" set in "field".
In your example it means nothing - but if there were multiple words in your query, i.e. "foo bar"~30, it would mean "find foo and bar within 30 positions of each other". It allows you to give a phrase match a margin in regard to how close each term has to be to each other.
The ^10 part is telling Lucene how much to weight the phrase match compared to other parts of the query.
From the Lucene Query Parser Syntax description:
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

In Apache Solr does position semantically mean the same thing as order?

In Apache Solr if I have two fields from two different documents:
field 1: "tom sawyer was a character in huckleberry finn"
field 2: "a character in huckleberry finn is tom sawyer"
*note that for simplicity the fields don't appear tokenized as shown here, but they are in the index
And I search for "a character in huckleberry finn," (also tokenized) will field 2 score higher because not only are the tokens in the same order in the field as they are in the query, but the position of the phrase in the text is at the beginning in both the field and in the query?
No. The positions are not used for computing the score, except for the positions in relation to each other if you use a phrase query. In your example, they're the same - so the score should be identical.
To avoid having a post for each similar question that you should have, it's probably better to refer to the Lucene Practical Scoring Formula which shows how the score is actually calculated for the TFIDF similarity. Remember that the similarity calculation is pluggable, so if you're using a different similarity, the calculation will be different.
These items are also simple to test by yourself - just index two documents with the text and issue a query with debugQuery set to true - and you'll see how each element contributes to the score.

Solr Minimum match customization

I have a case wherein I would like to match like this:
Query: abcd efgh ijkl mnop
After this the Query is subjected to NGram tokenizer and each word is split up into 2 gram tokens.
eg) The query is split up into,
ab,bc,cd,ef,fg,gh,ij,jk,kl,mn,no,op
Now while matching I want the minimum match to be customized for tokens in words.
I mean, By default when any one token corresponding to a word matches with the indexed document, with mm=1, that indexed document is returned. And if I give mm=2, then any one token from any 2 words need to match the indexed document to be returned.
But what I want is: Return a document only when any 'm' tokens each match for mm=num of words.
For example) I would want atleast 2 tokens each from atleast 3 words for the indexed document to be selected.
Seems IndexSearcher of Lucene does this core part. Do I need to change the code or any other config which would do the above stuff?
Thanks in advance...
This isn't exactly what you're asking for, but I'm guessing your underlying question is "how can I ensure that fuzzy searches only return things which are 'close' to the original query?"
The syntax foo~.8 does this - see the docs. Basically, .8 is a measure of the edit (Levenstein) distance divided by the length of the word.
If you want to stick to your idea of counting pairs which must match, you can do some math to figure out what the minimum levenstein distance needs to be.

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...

Resources