Optimize Full-Text Search Across Multiple Tables

Optimize Full-Text Search Across Multiple Tables - sql-server

I have the requirement to search several different tables in my SQL Server database. And I need to sort the results based on in which table the match occurred.
The approach I've taken is shown below. However, this doesn't seem very efficient as the amount of data grows.
Can anyone suggests any tricks to optimize this?
-- Full-text query
DECLARE #FtsQuery nvarchar(100)
SET #FtsQuery = 'FORMSOF(INFLECTIONAL, detail)'
-- Maximum characters in description column
DECLARE #MaxDescription int
SET #MaxDescription = 250
SELECT 1 AS RankGroup, FTS.Rank, Id, Title, LEFT([Description], #MaxDescription) AS Description FROM Table1
INNER JOIN CONTAINSTABLE(Table1, *, #FtsQuery) AS FTS ON FTS.[KEY] = Table1.Id
UNION SELECT 2, FTS.Rank, Id, Title, NULL FROM Table2
INNER JOIN CONTAINSTABLE(Table2, *, #FtsQuery) AS FTS ON FTS.[KEY] = Table2.Id
UNION SELECT 3, FTS.Rank, Id, Title, LEFT([Description], #MaxDescription) FROM Table3
INNER JOIN CONTAINSTABLE(Table3, *, #FtsQuery) AS FTS ON FTS.[KEY] = Table3.Id
UNION SELECT 4, FTS.Rank, Id, Title, LEFT([Description], #MaxDescription) FROM Table4
INNER JOIN CONTAINSTABLE(Table4, *, #FtsQuery) AS FTS ON FTS.[KEY] = Table4.Id
UNION SELECT 5, FTS.Rank, Id, Title, LEFT([Description], #MaxDescription) FROM Table5
INNER JOIN CONTAINSTABLE(Table5, *, #FtsQuery) AS FTS ON FTS.[KEY] = Table5.Id
ORDER BY RankGroup, Rank DESC
One idea I'd considered is to create an indexed view and then perform the search on the view. But since the view would need these UNIONs, it's hard to see how that would be any more efficient.

This is a difficult issue, because CONTAINSTABLE can only search a single table's FTS index at a time. Your UNION solution above is fine as long as your performance is acceptable.
We faced the same issue of needing to efficiently search many columns from many tables in a single query. What we did was aggregate all of the data from these columns and tables into a single read-only table. Our query then only needed a single CONTAINSTABLE call
CONTAINSTABLE(AggregatedTable, AggregatedColumn, #FtsQuery)
We have a scheduled job that runs every 5-10 minutes and incrementally aggregates any modified content from our source table into our single read-only aggregated content table.
In general it seems that using FTS in any reasonably-sized database and user load means you are always battling with performance. If you find that no matter what you do you cannot get the performance to be acceptable, you may need to investigate other technologies such as
Lucene.

Related

Is it possible to perform a join in Access on a second column if the first is blank?

I have this ugly source data with two columns, let's call them EmpID and SomeCode. Generally EmpID maps to the EmployeeListing table. But sometimes, people are entering the Employee IDs in the SomeCode field.
The person previously running this report in Excel 'solved' this problem by performing multiple vlookups with if statements, as well as running some manual checks to ensure results were accurate. As I'm moving these files to Access I am not sure how best to handle this scenario.
Ideally, I'm hoping to tell my queries to do a Left Join on SomeCode if EmpID is null, otherwise Left Join on EmpID
Unfortunately, there's no way for me to force validation or anything of the sort in the source data.
Here's the full SQL query I'm working on:
SELECT DDATransMaster.Fulfillment,
DDATransMaster.ConfirmationNumber,
DDATransMaster.PromotionCode,
DDATransMaster.DirectSellerNumber,
NZ([DDATransMaster]![DirectSellerNumber],[DDATransMaster]![PromotionCode]) AS EmpJoin,
EmployeeLookup.ID AS EmpLookup,
FROM FROM DDATransMaster
LEFT JOIN EmployeeLookup ON NZ([DDATransMaster]![DirectSellerNumber],[DDATransMaster]![PromotionCode]) = EmployeeLookup.[Employee #])

You can create a query like this:
SELECT
IIf(EmpID Is Null, SomeCode, EmpID) AS join_field,
field2,
etc
FROM YourTable
Or if the query will always be used within an Access session, Nz is more concise.
SELECT
Nz(EmpID, SomeCode) AS join_field,
field2,
etc
FROM YourTable
When you join that query to your other table, the Access query designer can represent the join between join_field and some matching field in the other table. If you were to attempt the IIf or Nz as part of the join's ON clause, the query designer can't display the join correctly in Design View --- it could still work, but may not be as convenient if you're new to Access SQL.
See whether this SQL gives you what you want.
SELECT
dda.Fulfillment,
dda.ConfirmationNumber,
dda.PromotionCode,
dda.DirectSellerNumber,
NZ(dda.DirectSellerNumber,dda.PromotionCode) AS EmpJoin,
el.ID AS EmpLookup
FROM
DDATransMaster AS dda
LEFT JOIN EmployeeLookup AS el
ON NZ(dda.DirectSellerNumber,dda.PromotionCode) = el.[Employee #])
But I would use the Nz part in a subquery.
SELECT
sub.Fulfillment,
sub.ConfirmationNumber,
sub.PromotionCode,
sub.DirectSellerNumber,
sub.EmpJoin,
el.ID AS EmpLookup
FROM
(
SELECT
Fulfillment,
ConfirmationNumber,
PromotionCode,
DirectSellerNumber,
NZ(DirectSellerNumber,PromotionCode) AS EmpJoin
FROM DDATransMaster
) AS sub
LEFT JOIN EmployeeLookup AS el
ON sub.EmpJoin = el.[Employee #])

What about:
LEFT JOIN EmployeeListing ON NZ(EmpID, SomeCode)
as your join, nz() uses the second parameter if the first is null, I'm not 100% sure this sort of join works in access. Worth 20 seconds to try though.
Hope it works.

You Could use a Union:
SELECT DDATransMaster.Fulfillment,
DDATransMaster.ConfirmationNumber,
DDATransMaster.PromotionCode,
DDATransMaster.DirectSellerNumber,
EmployeeLookup.ID AS EmpLookup
FROM DDATransMaster
LEFT JOIN EmployeeLookup ON
DDATransMaster.DirectSellerNumber = EmployeeLookup.[Employee #]
where DDATransMaster.DirectSellerNumber IS NOT NULL
Union
SELECT DDATransMaster.Fulfillment,
DDATransMaster.ConfirmationNumber,
DDATransMaster.PromotionCode,
DDATransMaster.DirectSellerNumber,
EmployeeLookup.ID AS EmpLookup
FROM DDATransMaster
LEFT JOIN EmployeeLookup ON
DDATransMaster.PromotionCode = EmployeeLookup.[Employee #]
where DDATransMaster.DirectSellerNumber IS NULL;

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.

Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

Performance Issues with Count(*) in SQL Server

I am having some performance issues with a query I am running in SQL Server 2008. I have the following query:
Query1:
SELECT GroupID, COUNT(*) AS TotalRows FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2
ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word) GROUP BY GroupID
Table1 contains about 500,000 rows. Table2 contains about 50,000, but will eventually contain millions. Playing around with the query, I found that re-writing the query as follows will reduce the execution time of the query to under 1 second.
Query 2:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
What I do not understand is it is a simple count query. If I execute the following query on Table 1, it returns in < 1 s:
Query 3:
SELECT Count(*) FROM Table1
This query returns around 500,000 as the result.
However, the Original query (Query 1) mentioned above only returns a count of 50,000 and takes 3s to execute even though simply removing the GROUP BY (Query 2) reduces the execution time to < 1s.
I do not believe this is an indexing issue as I already have indexes on the appropriate columns. Any help would be very appreciated.

Performing a simple COUNT(*) FROM table can do a much more efficient scan of the clustered index, since it doesn't have to care about any filtering, joining, grouping, etc. The queries that include full-text search predicates and mysterious subqueries have to do a lot more work. The count is not the most expensive part there - I bet they're still relatively slow if you leave the count out but leave the group by in, e.g.:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
GROUP BY GroupID;
Looking at the provided actual execution plan in the free SQL Sentry Plan Explorer*, I see this:
And this:
Which lead me to believe you should:
Update the statistics on both Inventory and A001_Store_Inventory so that the optimizer can get a better rowcount estimate (which could lead to a better plan shape).
Ensure that Inventory.ItemNumber and A001_Store_Inventory.ItemNumber are the same data type to avoid an implicit conversion.
(*) disclaimer: I work for SQL Sentry.

You should have a look at the query plan to see what SQL Server is doing to retrieve the data you requested. Also, I think it would be better to rewrite your original query as follows:
SELECT
Table1.GroupID -- When you use JOINs, it's always better to specify Table (or Alias) names
,COUNT(Table1.GroupID) AS TotalRows
FROM
Table1
INNER JOIN
Table2 ON
(Table2.Column1 = Table1.Column1) AND
(Table2.GroupID = #GroupID)
WHERE
CONTAINS(Table1.*, #Word)
GROUP BY
Table1.GroupID
Also, keep in mind that a simple COUNT and a COUNT with a JOIN and GROUP BY are not the same thing. In one case, it's just a matter of going through an index and making a count, in the other there are other tables and grouping involved, which can be time consuming depending on several factors.

most efficient query to get record from a table with IDS

I have a table with purpose of holding id's.
I want to select from other table ( a big table of millions of records) also many records.
Which one would outperform:
SELECT id, att1, att2
FROM myTable
WHERE id IN (SELECT id FROM #myTabwithIDS)
Or
SELECT id, att1, att2
FROM myTable t
INNER JOIN #myTabwithIDS t2
ON t2.id = t.id

I would use the Query Analyzer built in to SQL Server to explore the execution plan.
http://www.sql-server-performance.com/2006/query-analyzer/
Specifically turn on Show Execution Plan, and Statistics IO and Time.
Normally a join is better than a subquery, especially in your case where the outer queries condition depends on the results of the subquery (known as a correlated subquery). See Subqueries vs joins for more details.

Performance problem on a query

I have a performance problem on a query.
First table is a Customer table which has millions records in it. Customer table has a column of email address and some other information about customer.
Second table is a CommunicationInfo table which contains just Email addresses.
And What I want in here is; how many times the email address in CommunicationInfo table repeats in Customers table. What could be the the most performer query.
The basic query that I can explain this situation is;
Select ci.Email, count(*) from Customer c left join
CommunicationInfo ci on c.Email1 = ci.Email or c.Email2 = ci.Email
Group by ci.Email
But sure, it takes about 5, 6 minutes in execution.
Thanks in Advance.

this query is about as good as it gets if you have an index on Customer.Email and another on CommunicationInfo.Email
Select
c.Email, count(*)
from Customer c
left join CommunicationInfo ci on c.Email1 = ci.Email
left join CommunicationInfo ci2 on c.Email2 = ci2.Email
Group by c.Email

You mention:
And What I want in here is; how many
times the email address in
CommunicationInfo table repeats in
Customers table. What could be the the
most performer query.
To me, that sounds like you could easily use an INNER JOIN - this would most likely be a lot faster, since it will limit the search scope to just those customers who really do have an e-mail - anyone who doesn't have an e-mail at all (and thus a count(*) = 0) will not even be looked at - that might make a big difference even just in the number of rows SQL Server has to count and group.
So try this:
SELECT
ci.Email, COUNT(*)
FROM
dbo.Customer c
INNER JOIN dbo.CommunicationInfo ci
ON c.Email1 = ci.Email OR c.Email2 = ci.Email
GROUP BY
ci.Email
How does that perform in your case??

Using the OR condition robs the optimizer of opportunity to use HASH JOIN or MERGE JOIN.
Use this:
SELECT ci.Email, SUM(cnt)
FROM (
SELECT ci.Email, COUNT(c.Email) AS cnt
FROM CommunicationInfo ci
LEFT JOIN
Customer c
ON c.Email1 = ci.Email
GROUP BY
ci.Email
UNION ALL
SELECT ci.Email, COUNT(c.Email) AS cnt
FROM CommunicationInfo ci
LEFT JOIN
Customer c
ON c.Email2 = ci.Email
GROUP BY
ci.Email
) q2
GROUP BY
ci.Email
or this:
SELECT ci.Email, COUNT(*)
FROM CommunicationInfo ci
LEFT JOIN
(
SELECT Email1 AS email
FROM Customer c
UNION ALL
SELECT Email2
FROM Customer
) q
ON q.Email = ci.Email
GROUP BY
ci.Email
Make sure that you have indexes on Customer(Email) and Customer(Email2)
The first query will be more efficient if your emails are mostly not filled, the second one — if most emails are filled.

Depending on your environment there may not be much you can do to optimize this.
A couple of questions:
How many records in CommunicationInfo?
How often do you really need to run this query? Is it a one time analysis, or are multiple people going to be running this every 10 minutes?
Are the fields indexed? I'll make a guess that neither Email1 nor Email2 field is indexed. However, I wouldn't suggest adding an index without taking the balance of the whole system into consideration.
Why are you using a left join? Do you really need EVERYTHING from the Customer table? You're counting, so no harm in doing an INNER JOIN.
Suggestions:
Run the query through the Query Optimization wizard to see if there is anything SQL Server would recommend.
An extreme suggestion would be to dump the Email1 and Email2 columns into a temp table and join to that. I've seen queries run slowly because of a large amount of stress on a particular table, so sometimes copying the records into a temp table is faster, but this technique is very dependent on how much memory there is, how fast IO is, and the amount of stress on a particular table.