If fuzzy set A={(10,0.2),(20,0.4),(25,0.1),(5,0.8)} and fuzzy set B={(10,0.4),(20,0.3),(25,0.6)} what should be the answer for union and intersection?
Related
I am implementing solr fuzzy search using complex phrase query.
But i am phasing a weired case:
q={!complexphrase}name:"woo~1 grou~2" return "wood group" as a result.
q={!complexphrase}name:"woo~1 gro~2" does not return "wood group".
althouth distance between gro and group is 2!
searching for this query:
q={!complexphrase}name:"Anderso~1 Interes~2" returns 'Anderson Interests'.
duistance between Interes and Interests is same as gro and group!!!
any idea whats the reason?
I believe you are running into a problem with query rewrites.
Any multi-term query (fuzzy queries, prefix queries, etc.) gets expanded, in Lucene, into the exact terms that it matches. There is a maximum to the number of terms that can be generated this way though, so when rewriting the query, it will just try to pick the best within that limit. I suspect there are just too many matches for gro~2.
Perhaps you'll find it odd that there are so many matches that it can't incorporate all of them into the query. It looks like you are trying to search for words begining with gro, and with up to two more letters tacked onto the end. How many could there be? But that isn't what you're searching for. Fuzzy queries are based on levenshtein distance. The matches for that term include:
g__ -- Three-letter words beginning with g
_r_ -- Three-letter words with an r in the middle
__o -- Three-letter words with an o on the end
gr__ -- Any four-letter word beginning with gr
etc.
In short, it could match a massive list of terms, and in terms of similarity algorithm, "arm" and "cron" match just as well as "group".
If you really just want to match terms that start with "gro", use a prefix query instead: "woo* gro*".
If you want to actually search with a fuzzy query, including the list of possible matches seen above, you can enlarge the MaxBooleanClauses, in your solrconfig's query section.
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
I have this:
SELECT * FROM AwesomePeople WHERE CONTAINS(Name, 'NEAR(("Nathan", "Fillion"), MAX, TRUE)')
But I want to combine it so it uses my thesaurus of words to look at alternatives for Nathan and Fillion.
I can do this:
SELECT * FROM AwesomePeople WHERE CONTAINS(Name, 'FORMSOF (THESAURUS, "Nathan"))
But I don't know how to search for 2 words, or make it do FORMSOF and NEAR together in a single query. I have tried a few combinations but am out of luck.
Any ideas?
It looks like you are using SQL Server 2012 as 'NEAR(("Nathan", "Fillion") is the newer form of proximity search, called custom proximity search.
From technet:
http://technet.microsoft.com/en-us/library/ms142568%28v=sql.110%29.aspx
You cannot combine a custom proximity term with a generic proximity
term (term1 NEAR term2), a generation term (ISABOUT …), or a weighted
term (FORMSOF …).
and also lower down
You cannot combine a generic proximity term with a custom proximity
term, such as NEAR((term1,term2),5), a weighted term (ISABOUT …), or a
generational term (FORMSOF …).
Technet seems to have the ISABOUT and FORMSOF mixed up in the first quote, but either way ISABOUT or FORMSOF terms cannot be combined with a NEAR term.
Following will work great. It is more powerful.
SELECT * FROM AwesomePeople AS C INNER JOIN
CONTAINSTABLE(AwesomePeople ,name, 'ISABOUT (
FORMSOF(Thesaurus, "Nathan"),
FORMSOF(Thesaurus, "Fillion"))') AS K
ON C.ID = K.[KEY];
I am using edismax ranking in solr 4.1. I have a queryparser which generates a few sub queries given a single query string. As I was looking at the specific ranking detail (by adding "debugQuery=on"), I found the followings:
1> It looks like rank scores of all sub queries are added first
2> And then there is a multiplication of this total score and coord factor. It looks like coord factor is the ratio of how many sub queries got match. For example, if a query turns into 3 sub queries and if only 1 of them gets hit, then coord factor would be 1/3.
I am wondering 1> whether my observation is correct. 2> if so, whether there is a way to change these behaviour something like the followings:
1> Instead of summing the scores of sub queries, just take the max score.
2> Ignore coord factor.
If current solr 4.1 implementation doesn't support, any pointer which source code to change or use as a reference would be great.
Check for the params which control the behaviour :-
Tie Breaker -
A value of "0.0" makes the query a pure "disjunction max query" --
only the maximum scoring sub query contributes to the final score. A
value of "1.0" makes the query a pure "disjunction sum query" where it
doesn't matter what the maximum scoring sub query is, the final score
is the sum of the sub scores. Typically a low value (ie: 0.1) is
useful.
Coord -
In Solr 1.4 and prior, you should basically set mm=0 if you want the
equivilent of q.op=OR, and mm=100% if you want the equivilent of
q.op=AND. In 3.x and trunk the default value of mm is dictated by the
q.op param (q.op=AND => mm=100%; q.op=OR => mm=0%). Keep in mind the
default operator is effected by your schema.xml entry. In older versions of Solr the default
value is 100% (all clauses must match)
Remove the mm factor to remove the coord calculation and set the tie to 0 to consider the maximum of the score.
Using Sphinx I can rank document any way I want.
SELECT *
FROM someIndex
WHERE MATCH('foo bar')
OPTION ranker=expr('<any rank expression>')
How can I achieve same behavior with Solr? Is {!boost q=<some_boost_expression>} is the only way? For example, I need to documents with more number of words have higher score:
A: foo bar blah blah blah
B: foo bar
I need A to be more relevant for foo bar query. Right now B have higher score.
You can apply boost functions (bf attribute) to customize your scoring in a more complex way than a simple query term boost. This is available in the DisMax query parser, and, as you might expect, is further extended in the Extended dismax query parser
The norm is where you would normally expect to find information readily available about the length of the field, although it will be combined with any field level boost found, and you logic (to weigh more heavily the longer field) is the reverse of the default scoring. That will make supporting field boosts and that logic difficult, unless you create a custom Similarity. Norms, by the way, are stored at index time, not calculated at query time, if you decide to take that route.
There is a store procedure that uses FREETEXTTABLE twice on two tables and then merges the results and returns the top 50.
The problem is if I do a search on "Women of Brewster", the results returns "Confession of an ex doofus motha" with a rank of 143 from table A and second "Women of Brewster Place" with a rank of 102 from table B.
Is this because of the count? (Table A return results total is 2399. Table B return results total is 3445.)
The short answer:
Freetext ranking is based on the OKAPI
BM25 ranking formula. Each term in the
query is ranked, and the values are
summed. Freetext queries will add
words to the query via inflectional
generation (stemmed forms of the
original query terms); these words are
treated as separate terms with no
special weighting or relationship with
the words from which they were
generated. Synonyms generated from the
Thesaurus feature are treated as
separate, equally weighted terms.
The much longer, and far more complicated answer can be found on Microsoft's site, of course. For advanced mathematics, click here.
1) The noise file was limited to a few characters, meaning that the word "of" is now consider important.
2) The two tables results (count) do matter, since the smaller table will most likely be given a better weight value. This will skew the rank to be higher in a smaller table.
Josef's link to MSDN was great at figuring out how it computes the rank value.
USE AdventureWorks2012;
GO
SELECT FT_TBL.Description
,KEY_TBL.RANK
FROM Production.ProductDescription AS FT_TBL
INNER JOIN FREETEXTTABLE(Production.ProductDescription,
Description,
'high level of performance') AS KEY_TBL
ON FT_TBL.ProductDescriptionID = KEY_TBL.[KEY]
ORDER BY RANK DESC;
GO
Use this INNER JOIN approach to get the relevant results in sorted order.
Reference: Azure SQL FREETEXABLE