I have two tables
Product (id int, brandid int,Name nvarchar(1000)) 2+ million rows
Brand (id int,name nvarchar(1000)) 20k rows
FullText index is enabled on both table's name field.
If I do a search such as
SELECT count(*)
FROM Product p join Brand b
on p.BrandID = b.ID
WHERE contains(b.name,'calvin')
Runs super fast (less than a second). Same result if ran against p.name field.
But this query
SELECT count(*)
FROM Product p join Brand b
on p.BrandID = b.ID
WHERE contains(b.name,'calvin')
OR
contains(p.Name,'calvin')
takes over a minute (several minutes actually). If OR is changed to AND it is also super fast.
I cannot use unions or containstable since I'm using nhibernate.
I recommend you read up SQL Server 2005 Full-Text Queries on Large Catalogs: Lessons Learned.
Any query with OR is a possible source of performance problems. An OR between the two tables of a join, that is like an invitation for disaster since the optimizer has basically lost any information how to optimize the join. Throw in full text condition, which will make cardinaility prediction (number of rows that match the condition) a wild guess at best and you got yourself a perfect storm.
Get rid of the OR. Period. Modify your requirements, discard the middle man (ORM layers).
Related
My application access data from a table in SQL Server. Consider the table name is PurchaseDetail with some other columns.
The select query has below where clauses.
1. name - name has 10000 values only.
2. createdDateTime
The actual query is
select *
from PurchaseDetail
where name in (~2000 name)
and createdDateTime = 'someDateValue';
The SQL tuning advisor gave some recommendation. I tried with those recommended indexes. The performance increased a bit but not completely.
Is there any wrong in my query? or Is there any possible to change/improve my select query?
Because I didn't use IN in where clause before. My table having more than 100 million records.
Any suggestion please?
In this case using IN for that much data is not good at all.
this best way is to use INNER JOIN instead.
It would be nicer if insert those names into a temp table and INNER JOIN it with your SELECT query.
If I have two very large tables (TableA and TableB), both with an Id column, and I would like to remove all rows from TableA that have their Ids present in TableB. Which would be the fastest? Why?
--ISO-compatible
DELETE FROM TabelA
WHERE Id IN (SELECT Id FROM TableB)
or
-- T-SQL
DELETE A FROM TabelA AS A
INNER JOIN TableB AS B
ON A.Id = B.Id
If there are indexes on each Id, they should perform equally well.
If there are not indexes on each Id, exists() or in () may perform better.
In general I prefer exists() over in () because it allows you to easily add more than one comparison when needed.
delete a
from tableA as a
where exists (
select 1
from tableB as b
where a.Id = b.Id
)
Reference:
in vs inner join - Gail Shaw
exists() vs in - Gail Shaw
As long as your Id in TableB is unique, both queries should create the same execution plan. Just include the execution plan to each queries and verify it.
Take a look at this nice post: in-vs-join-vs-exists
There's an easy way to find out, using the execution plan (press ctrl + L on SSMS).
Since we don't know the data model behind your tables (the eventual indexes etc), we can't know for sure which query will be the fastest.
By experience, I can tell you that, for very large tables (>1mil rows), the delete clause is quite slow, because of all the logging. Depending on the operation you're doing, you will want SQL Server NOT TO log the delete.
You might want to check at this question :
How to delete large data of table in SQL without log?
We have a stored procedure that searches products based on a number of input parameters that differ from one scenario to the next. Depending on the input parameters the search involves anywhere from two to about a dozen different tables. In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
In one of the most basic scenarios the user searches products by a keyword alone (see Query 1 below), which usually takes less than a second. However, if they search by a keyword and department (Query 2 below), the execution time goes up to well over a minute, and the execution plan looks somewhat different (the attached snapshots of the plans are showing just the parts that differ).
Query 1 (fast)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Query 2 (slow)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.DepartmentID = #DepartmentID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Both the Product and ProductVariant table have some string columns that participate in the full-text index. The Product table has a non-clustered indexed on the SourceID column and another non-clustered indexed on SourceID+DepartmentID (this redundancy is not an oversight but is intended). ProductVariant.ProductID is a FK to Product and has a non-clustered index on it. Statistics are updated for all indexes and columns, and no missing indexes are reported by SQL Management Studio.
Any suggestions on what might be causing this drastically different performance?
P.S. Forgot to mention that Product.DepartmentID is a FK to a table of departments, in case it makes any difference.
Thanks to #MartinSmith for the suggestion to break the full-text search logic out into temp tables and then using them to filter the results of the main query. The following returns in just 2 seconds:
SELECT
[Key] AS ProductID
INTO
#matchingProducts
FROM
CONTAINSTABLE(Product, *, #Keywords)
SELECT
[Key] AS VariantID
INTO
#matchingVariants
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN #matchingProducts ON #matchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN #matchingVariants ON #matchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT #matchingProducts.ProductID IS NULL OR NOT #matchingVariants.VariantID IS NULL)
Curiously, when I tried to simplify the above solution using nested queries as shown below, the results were somewhere in-between in terms of speed (around 25 secs). Theoretically, the query below should be identical to the one above, yet somehow SQL Server internally compiles the second one differently.
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN
(
SELECT
[Key] AS ProductID
FROM
CONTAINSTABLE(Product, *, #Keywords)
) MatchingProducts
ON MatchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN
(
SELECT
[Key] AS VariantID
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
) MatchingVariants
ON MatchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT MatchingProducts.ProductID IS NULL OR NOT MatchingVariants.VariantID IS NULL)
This may have been your mistake
In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
The dynamic SQL cannot be optimized by the server in most cases. There are certain techniques to mitigate this. Read more The Curse and Blessings of Dynamic SQL
Get rid of your dynamic SQL and build a decent query using proper indices. I assure you; the SQL Server knows better than you when it comes to optimizing. Define ten different queries if you must (or a hundred).
Secondly... Why would you ever expect the same execution plan when running different queries? using different columns/indices? The execution plans and results you get seem perfectly natural to me, given your approach.
You will not get the same execution plan / performance because you are not querying column 'DepartmentID' the first query.
We had an issue since a recent update on our database (I made this update, I am guilty here), one of the query used was much slower since then. I tried to modify the query to get faster result, and managed to achieve my goal with temp tables, which is not bad, but I fail to understand why this solution performs better than a CTE based one, which does the same queries. Maybe it has to do that some tables are in a different DB ?
Here's the query that performs badly (22 minutes on our hardware) :
WITH CTE_Patterns AS (
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email AS PELE WITH(NOLOCK) ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
),
CTE_Emails AS (
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED WITH(NOLOCK) ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
)
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM CTE_Patterns AS BL WITH(NOLOCK)
INNER JOIN CTE_Emails AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
When running both CTE queries separately, it's super fast (0 secs in SSMS, returns 122 rows and 13k rows), when running the full query, with INNER JOIN on sEmail, it's super slow (22 minutes)
Here's the query that performs well, with temp tables (0 sec on our hardware) and which does the eaxct same thing, returns the same result :
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
INTO #tb1
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email PELE ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
INTO #tb2
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM #tb1 AS BL WITH(NOLOCK)
INNER JOIN #tb2 AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
DROP TABLE #tb1
DROP TABLE #tb2
Tables stats :
OtherDb.dbo.Purchased_Email_List : 13 rows, 2 rows flagged bPattern = 1
OtherDb.dbo.Purchased_Email_List_Email : 324289 rows, 122 rows with patterns (which are used in this issue)
dbo.NewsletterService_import_list_email : 15.5M rows
dbo.NewsletterService_import_list_email_distinct ~1.5M rows
WHERE ILE.iId_newsletterservice_import_list = 1000 retrieves ~ 13k rows
I can post more info about tables on request.
Can someone help me understand this ?
UPDATE
Here is the query plan for the CTE query :
Here is the query plan with temp tables :
As you can see in the query plan, with CTEs, the engine reserves the right to apply them basically as a lookup, even when you want a join.
If it isn't sure enough it can run the whole thing independently, in advance, essentially generating a temp table... let's just run it once for each row.
This is perfect for the recursion queries they can do like magic.
But you're seeing - in the nested Nested Loops - where it can go terribly wrong.
You're already finding the answer on your own by trying the real temp table.
Parallelism. If you noticed in your TEMP TABLE query, the 3rd Query indicates Parallelism in both distributing and gathering the work of the 1st Query. And Parallelism when combining the results of the 1st and 2nd Query. The 1st Query also incidentally has a relative cost of 77%. So the Query Engine in your TEMP TABLE example was able to determine that the 1st Query can benefit from Parallelism. Especially when the Parallelism is Gather Stream and Distribute Stream, so its allowing the divying up of work (join) because the data is distributed in such a way that allows for divying up the work then recombining. Notice the cost of the 2nd Query is 0% so you can ignore that as no cost other than when it needs to be combined.
Looking at the CTE, that is entirely processed Serially and not in Parallel. So somehow with the CTE it could not figure out the 1st Query can be run in Parallel, as well as the relationship of the 1st and 2nd query. Its possible that with multiple CTE expressions it assumes some dependency and did not look ahead far enough.
Another test you can do with the CTE is keep the CTE_Patterns but eliminate the CTE_Emails by putting that as a "subquery derived" table to the 3rd Query in the CTE. It would be curious to see the Execution Plan, and see if there is Parallelism when expressed that way.
In my experience it's best to use CTE's for recursion and temp tables when you need to join back to the data. Makes for a much faster query typically.
I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos
I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.