Ranking of Full Text Search (SQL Server) - sql-server

For the last couple hours I have been messing with all sorts of different variations of SQL Server full text search. However I am still unable to figure out how the ranking works. I have come across a couple examples that really confuse me as to how they rank higher then others. For example
I have a table with 5 cols + more that are not indexed. All are nvarchar fields.
I am running this query (Well almost.. I retyped with different names)
SET #SearchString = REPLACE(#Name, ' ', '*" OR "') --Splits words with an OR between
SET #SearchString = '"'+#SearchString+'*"'
print #SearchString;
SELECT ms.ID, ms.Lastname, ms.DateOfBirth, ms.Aka, ms.Key_TBL.RANK, ms.MiddleName, ms.Firstname
FROM View_MemberSearch as ms
INNER JOIN CONTAINSTABLE(View_MemberSearch, (ms.LastName, ms.Firstname, ms.MiddleName, ms.Aka, ms.DateOfBirth), #SearchString) AS KEY_TBL
ON ms.ID = KEY_TBL.[KEY]
WHERE KEY_TBL.RANK > 0
ORDER BY KEY_TBL.RANK DESC;
Thus if I search for 11/05/1964 JOHN JACKSON I would get "11/05/1964" OR "JOHN*" OR "JACKSON*" and these results:
ID -- First Name -- Middle Name -- Last Name -- AKA -- Date of Birth -- SQL Server RANK
----------------------------------------------------------------------------------
1 | DAVE | JOHN | MATHIS | NULL | 11/23/1965 | 192
2 | MARK | JACKSON | GREEN | NULL | 05/29/1998 | 192
3 | JOHN | NULL | JACKSON | NULL | 11/05/1964 | 176
4 | JOE | NULL | JACKSON | NULL | 10/04/1994 | 176
So finally my question. I don't see how row 1 and 2 are ranked above row 3 and why row 3 is ranked the same as row 4. Row 2 should have the highest rank by far seeing as the search string matches the First name and Last Name as well as the Date of birth.
If I change the OR to AND I don't get any results.

I've found AND and OR clauses don't apply across columns. Create an indexed view that merges the columns and you'll get better results. Look at my past questions and you'll find information that suites your scenario.
I also have found I'm better off not appending a '*'. I thought it'd turn up more matches, but it tended to return worse results (particularly for long words). As a middle ground you might only append a * to longer words.
The example case you give is definately weird.

It's not entirely equivalent, but perhaps this question I asked (How-to: Ranking Search Results) could be of assistance?

What happens if you remove the DoB criteria?
MS Full-Text search is really really a black box that's hard to understand and customize
You pretty much take it AS IS, unlike Lucene is great for customization

Thank you guys.
Frank you were correct that AND and OR do not go across columns this was something I did not notice at first.
To get the best results I had to merge all 5 columns into 1 column in a view. Then search on that single column. Doing so gave me the exact results I wanted without any extras.
My actual search string after converting it ended up being "Word1*" AND "Word2*"
Using the % sign still did not do what msdn said it should do. Meaning if I searched for the word josh and it got changed into "Josh%" when I searched then "Joshua" would not be found. Pretty dumb however with "Josh*" then joshua would be found.

Related

SQL Server - Query String Greater Or Equal To

I am attempting to optimise a query in my application that is causing problems when scaling my application.
The table contains two columns: FROM and TO which each contain values. Here is an example:
Row | From | To
1 | AA | Z
2 | B | C
3 | JA | JZ
4 | JM | JZ
The query is passed a name (JOHN) and should return a list of ranges from the table that could contain the name.
select * from Ranges where From <= 'JOHN' and To >= 'JOHN'
Using the table above this would result in rows 1 and 3 being returned.
The problem I am having is one of query consistency.
All indexes are in place but if I search for JOHN the query returns in 20 milliseconds, whereas MARK returns in 250 milliseconds.
Looking at query analyzer shows me that JOHN is actually searching for more rows than MARK but I'm struggling to understand how or why MARK takes so long.
If the time difference was 20 - 40 milliseconds, I could live with that but 250 is so large a difference that the overall performance of my application is terrible.
Does anybody have any idea how I could narrow down why I get such variance in my queries OR a better way of storing and searching for string ranges (which could contains letters and numbers).
Many thanks in advance.
EDIT - One thing I forgot to mention was that the original table contains approximately 15 million rows (its actually postcodes).

SQL Server 2005 & Up - Concatenated Value based on Value in Joined Table

Prelude: The design of this database is truly horrible - this isn't the first "crooked" question I've asked, and it won't be the last. The question is what it is, and I'm only asking because A) I only have a couple of years of experience with SQL Server, and B) I've already been pounding my face on my keyboard for a couple of days trying to find a viable solution.
Having said all of that...
We have a database with two tables relevant to this issue. The schema is ridiculous, so I'm going to paraphrase so that it can be understood:
T_Customer & T_Task
T_Task holds data about various work that has been performed on behalf of a customer in T_Customer.
In T_Customer, there is a field called "Sort_Type" (again, paraphrasing...). In this field, there is a concatenated string of various fields from T_Task which, in the order specified, determine how the customer's report is produced in the client program. There are a total of 73 possible fields in T_Task that can be chosen as Sort_Type in T_Customer, and the user can choose up to 5 of them in any given order. For example:
T_Customer
Customer_ID | Sort_Type
------------|-------------------------------
1 | 'Task_Date,Task_Type,Task_ID'
2 | 'Task_Type,Destination'
3 | 'Route,Task_Type,Task_ID'
T_Task
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination
--------|-------------|-----------|-----------|---------|-------------
12345 | 1 | 1 | 01/01/2017| '1 to 2'| '2'
12346 | 1 | 1 | 01/02/2017| '3 to 4'| '4'
12347 | 2 | 2 | 12/31/2016| '6 to 2'| '2'
12348 | 3 | 3 | 01/01/2017| '4 to 1'| '1'
In this example, Customer #1's report would be sorted/totaled by the Task_Date, then by Task_Type, then by Task_ID; but not simply by doing an ORDER BY. This function requires one single value which can be ordered as a whole, single unit. As such...
Up until today, a field existed in T_Task called (paraphrasing....) 'MySort'. This field contained a concatenated string of fixed-width values filled in with zeroes and created according to the order and content of the values in T_Customer.Sort_Type. In this case:
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination | MySort
--------|-------------|-----------|-----------|-------|-------------|-------
12345 | 1 | 1 | 01/01/2017| 1 to 2| 2 |'002017010100000000010000012345'
12346 | 1 | 1 | 01/02/2017| 3 to 4| 4 |'002017010200000000010000012346'
12347 | 2 | 2 | 12/31/2016| 6 to 2| 2 |'000000000000000000020000000002'
12348 | 3 | 3 | 01/01/2017| 4 to 1| 1 |'000040to0100000000030000012348'
During the printing phase of every single report, the program would search for the customer, find the values in T_Customer.Sort_Type, split them by commas, and then run an update on all of the tasks of that customer to update the value of MySort accordingly...
Can you guess what the problem is with this? Performance (not to mention chronic insanity)
I have been tasked with finding a more efficient way of performing this same task server-side, within SQL Server 2005 if possible, using whatever means will eventually allow me to return a result set including all of the details of the tasks requested, together with a concatenated string similar to the one used in the past (which the client program relies upon in order to sort and subtotal the report).
I've tried Views, UDFs in computed columns, and parameterized queries, but I know my limitations. I'm too inexperienced to know all of my options.
Question: Aside from quitting (not an option) or going berserk (considering it...), what methods might you use to solve this problem?
EDIT: Having received two questions about the MySort column already,
I'll explain a bit better.
T_Task.MySort =
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value1])
+ CAST(T_Customer.Sort_Type[Value1] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value2])
+ CAST(T_Customer.Sort_Type[Value2] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value3])
+ CAST(T_Customer.Sort_Type[Value3] AS VARCHAR(10))
WHERE T_Customer.Customer_ID = T_Task.Customer_ID
...Up to T_Customer.Sort_Type[Value5].
Reminder: Those values are not constants at all, so the value of the
field MySort had to constantly be updated before printing a report.
The idea is to somehow remove the need to constantly update the field,
and instead return the string as part of the result set.
The resulting string should always be 50 characters in length. I
didn't do that here simply to save a bit of space and time - I chose
only 3 for the example. The real string would simply have another
twenty zeroes leading the value:
'00000000000000000000002017010100000000010000012345'

Optimal View Design To Find Mismatches Between Two Sets of Data

A bit of background...my company utilizes a piece of software that stores information about a mortgage loan in independent fields. These fields are broken up across many tables in the loan database.
My current dilemma revolves around designing a view(s) that will allow me to find mismatched data on a subset of loans from the underwriting side of our software and the lock side of our software.
Here is a quick example of the data returned from the two views that already exist:
UW View
transID | DTIField | LTVField | MIField
50000 | 37.5 | 85.0 | 1
Lock View
transID | DTIField | LTVField | MIField
50000 | 42.0 | 85.0 | 0
In the above situation, the view should return the fields that are not matching (in this case the DTIField and the MIField). I have built a comparison view that uses a series of CASE statements to return either a 0 for not matched or a 1 for matched already:
transID | DTIField | LTVField | MIField
50000 | 0 | 1 | 0
This is fine in itself but it is creating a bit of an issue downstream on the reporting side. We want to be able to build a report that would display only those transIDs that have mismatched data and show which columns are not matched. Crystal Reports is the reporting solution in question.
Some specifics about the data sets...we have 27 items of the loan that we are comparing (so a total 54 fields). There are over 4000 loans in the system and growing. There are already indexes on the transID fields.
How would you structure the view to return all the data needed for the report? We can do a good amount of work in Crystal Reports but ideally much of the logic would be handled in MSSQL.
Thanks for any assistance.
I think there should be no issue in comparing the 27 columns for a given row. Since you'll be reading the row just once and comparing the columns on that row in both the tables, it shouldn't really pose any performance issues. You can use some hash functions HASHBYTES to assign a hash value to the combination of these 27 fields in both the tables and then use this field to compare which rows should be returned by the view. This should result in some performance improvement. Testing will reveal more.

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

Resources