Performance Issues with Count(*) in SQL Server - sql-server

I am having some performance issues with a query I am running in SQL Server 2008. I have the following query:
Query1:
SELECT GroupID, COUNT(*) AS TotalRows FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2
ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word) GROUP BY GroupID
Table1 contains about 500,000 rows. Table2 contains about 50,000, but will eventually contain millions. Playing around with the query, I found that re-writing the query as follows will reduce the execution time of the query to under 1 second.
Query 2:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
What I do not understand is it is a simple count query. If I execute the following query on Table 1, it returns in < 1 s:
Query 3:
SELECT Count(*) FROM Table1
This query returns around 500,000 as the result.
However, the Original query (Query 1) mentioned above only returns a count of 50,000 and takes 3s to execute even though simply removing the GROUP BY (Query 2) reduces the execution time to < 1s.
I do not believe this is an indexing issue as I already have indexes on the appropriate columns. Any help would be very appreciated.

Performing a simple COUNT(*) FROM table can do a much more efficient scan of the clustered index, since it doesn't have to care about any filtering, joining, grouping, etc. The queries that include full-text search predicates and mysterious subqueries have to do a lot more work. The count is not the most expensive part there - I bet they're still relatively slow if you leave the count out but leave the group by in, e.g.:
SELECT GroupID FROM Table1
INNER JOIN (
SELECT Column1 FROM Table2 WHERE GroupID = #GroupID
) AS Table2 ON Table2.Column1 = Table1.Column1
WHERE CONTAINS(Table1.*, #Word)
GROUP BY GroupID;
Looking at the provided actual execution plan in the free SQL Sentry Plan Explorer*, I see this:
And this:
Which lead me to believe you should:
Update the statistics on both Inventory and A001_Store_Inventory so that the optimizer can get a better rowcount estimate (which could lead to a better plan shape).
Ensure that Inventory.ItemNumber and A001_Store_Inventory.ItemNumber are the same data type to avoid an implicit conversion.
(*) disclaimer: I work for SQL Sentry.

You should have a look at the query plan to see what SQL Server is doing to retrieve the data you requested. Also, I think it would be better to rewrite your original query as follows:
SELECT
Table1.GroupID -- When you use JOINs, it's always better to specify Table (or Alias) names
,COUNT(Table1.GroupID) AS TotalRows
FROM
Table1
INNER JOIN
Table2 ON
(Table2.Column1 = Table1.Column1) AND
(Table2.GroupID = #GroupID)
WHERE
CONTAINS(Table1.*, #Word)
GROUP BY
Table1.GroupID
Also, keep in mind that a simple COUNT and a COUNT with a JOIN and GROUP BY are not the same thing. In one case, it's just a matter of going through an index and making a count, in the other there are other tables and grouping involved, which can be time consuming depending on several factors.

Related

Why does sql server do a scan on joins when there are no records in source table

The idea of the below query is to use the CTE to get the primary key of all rows in [Archive].[tia_tia_object] that meet the filter.
The execution time for the query within the CTE is 0 seconds.
The second part is supposed to do joins on other tables, to filter the data some more, but only if there are any rows returned in the CTE. This was the only way I could get the SQL server to use the correct indexes.
Why does it spend time (see execution plan) looking in TIA_TIA_AGREEMENT_LINE and TIA_TIA_OBJECT, when CTE returns 0 rows?
WITH cte_vehicle
AS (SELECT O.[Seq_no],
O.Object_No
FROM [Archive].[tia_tia_object] O
WHERE O.RECORD_TIMESTAMP >
(SELECT LastLoadTimeStamp FROM staging.Ufngetlastloadtimestamp('Staging.CoveredObject'))
AND O.[Meta_iscurrent] = 1
AND O.OBJECT_TYPE IN ( 'BIO01', 'CAO01', 'DKV', 'GFO01',
'KMA', 'KNO01', 'MCO01', 'VEO01',
'SVO01', 'AUO01' ))
SELECT O.[Seq_no] AS [Bkey_CoveredObject],
Cast(O.[Agr_Line_No] AS BIGINT) AS [Agr_Line_No],
O.[Cover_Start_Date] AS [CoverageFrom],
O.[Cover_End_Date] AS [CoverageTo],
O.[Timestamp] AS [TIMESTAMP],
O.[Record_Timestamp] AS [RECORD_TIMESTAMP],
O.[Newest] AS [Newest],
O.LOCATION_ID AS LocationNo,
O.[Cust_no],
O.[N01]
FROM cte_vehicle AS T
INNER JOIN [Archive].[tia_tia_object] O
ON t.Object_No = O.Object_No
AND t.Seq_No = O.Seq_No
INNER JOIN [Archive].[tia_tia_agreement_line] AL
ON O.Agr_line_no = AL.Agr_line_no
INNER JOIN [Archive].[tia_tia_policy] P
ON AL.Policy_no = P.Policy_no
WHERE P.[Transaction_type] <> 'D'
Execution plan:
Because it still needs to check and look for records. Even if there are no records in that table, it doesn't know that until it actually checks.
Much like if someone gives you a sealed box, you don't know it's empty or not till you open it.

Too many parameter values slowing down query

I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here

Update with limit and offset applied to joined table

I have an UPDATE with an INNER JOIN. My overall question is how (if it is possible at all) to set LIMIT and OFFSET to that joined table.
Example query without limit and offset:
UPDATE t2
SET t2.some_col = t1.some_col
FROM table_1 t1
INNER JOIN table_2 t2
ON t1.other_col = t2.other_col
And how to rebuild this query to get only first 1000000, 1000000 - 2000000, 2000000 - 3000000, ... etc. records from t2.
Exact scenery:
My task is to rebuild very large tables with hash indexes (char(32)) to bigint indexes. Example tables:
URLS: PAGE_VIEWS:
id char(32) urlId char(32)
other_columns referrerUrlId char(32)
intUrlId bigint (added and filled) other_columns
intUrlId bigint (needs to update)
intReferrerUrlId bigint (needs to update)
First table is about 200 mln records, second over 1 bln. I update this tables in packs. The update job wouldn't be difficult if I could use WHERE urls.intUrlId BETWEEN ... but I can't. Sometimes JOIN return on example 500000 records for single pack but many times it returns 0 so it update 0 records but join in such big tables costs quite a lot of time. So I need equal packs limited by page_views table not urls table. Page_views table has no column I can base WHERE clause so I need limit this table by TOP and ROW_NUMBER() clauses but I dunno how. (I'm quite new in MsSQL, I used to work on MySQL and PostgreSql databases which has LIMIT and OFFSET clauses.
For any answer I would appreciate info about cost of this solution because someone would appreciate any LIMIT - OFFSET solution but not me. I already have query which update what I need. But it use intUrlId from urls table and it is slow. I need faster solution. Server version 2008.
BTW. Don't ask me who the hell based database on char indexes :-) Now it become a problem and multi TBs database needs to be rebuilded.
You can try using a CTE with a RowNumber
WITH toUpdate AS
(
SELECT urlId, intUrlId, ROW_NUMBER() OVER (ORDER BY something) AS RowNumber
FROM [XXX].[ZZZ].[Urls]
)
UPDATE pv
SET pv.intUrlId = urls.intUrlId
FROM toUpdate urls
INNER JOIN [XXX].[YYY].[PageViews] pv WITH(NOLOCK) ON pv.urlId = urls.id and RowNumber between 10000 and 20000
To answer the question "how to set LIMIT, OFFSET to joined table" in Jeremy's answer tables needs to be switched. I'll give correct answer for example query I used in my question.
WITH toUpdate AS
(
SELECT some_col, other_col, ROW_NUMBER() OVER (ORDER BY any_column) AS RowNumber
FROM table_2
)
UPDATE toUpdate
SET toUpdate.some_col = t1.some_col
FROM table_1 t1
INNER JOIN toUpdate ON t1.other_col = toUpdate.other_col
AND RowNumber BETWEEN 1000000 AND 2000000

How does t-sql update work without a join

I think my head is muddy or something. I'm trying to figure out how a t-sql update works without a join when updating one table from another. I've always used joins in the past but came across a stored proc where someone else created one without a join. This update is being used in SQL 2008R2 and it works.
Update table1
SET col1 = (SELECT TOP 1 colX FROM table2 WHERE colZ = colY),
col2 = (SELECT TOP 1 colE FROM table2 WHERE colZ = colY)
Obviously, colY is a field in table1. To get the same results in a select statement (not update), a join is required. I guess I don't understand how an update works behind the scenes but it must be doing some kind of join?
SQL Server translates those subqueries into joins. You can look at this by getting the query plan. You can write an equivalent query with UPDATE ... FROM ... JOIN syntax and observe the query plan to be essentially the same.
The sample code shown is unusual, hard to understand, redundant and inflexible. I recommend against using this style.
No it's doing a sub query, well two in this case. Be damn painful if you have another 98 col fields.
You can do something similar for select
select *,
(SELECT TOP 1 colX FROM table2 WHERE colZ = colY) as col1
From table1
A left join would simply be more efficient
Your example unless the dbms optimises it it running the subquery(ies) for each row in table.
Got to say whoever wrote it is less than competent.
These subqueries are what is called correlated subqueries. If you were to write the same query as a SELECT rather than an UPDATE it would look like this.
SELECT col1 = (SELECT TOP 1 table2.colX FROM table2 WHERE table2.colZ = table1.colY),
col2 = (SELECT TOP 1 table2.colE FROM table2 WHERE table2.colZ = table1.colY)
FROM table1
The JOIN is in the fact that you are referencing a column from an outside table on the inside of the subquery. Table1 is referenced in the UPDATE command. You can include a FROM clause but it isn't required for a setup like this.
You can use the same syntax in a SELECT with no join, but you need to alias the table if colY also exists in table2
SELECT (SELECT TOP 1 colX FROM table2 WHERE colZ = T.colY)
, (SELECT TOP 1 colE FROM table2 WHERE colZ = T.colY)
FROM table1 AS T
I only ever use this sort of thing when building up an ad hoc query just for my own infomation. If it's going to be put into any sort of permanent code I'll convert it to a join as it's easier to read and more maintainable.

most efficient query to get record from a table with IDS

I have a table with purpose of holding id's.
I want to select from other table ( a big table of millions of records) also many records.
Which one would outperform:
SELECT id, att1, att2
FROM myTable
WHERE id IN (SELECT id FROM #myTabwithIDS)
Or
SELECT id, att1, att2
FROM myTable t
INNER JOIN #myTabwithIDS t2
ON t2.id = t.id
I would use the Query Analyzer built in to SQL Server to explore the execution plan.
http://www.sql-server-performance.com/2006/query-analyzer/
Specifically turn on Show Execution Plan, and Statistics IO and Time.
Normally a join is better than a subquery, especially in your case where the outer queries condition depends on the results of the subquery (known as a correlated subquery). See Subqueries vs joins for more details.

Resources