I have two tables: Client and Transaction (they can be seen as an example in this db-fiddle). A Client may have thousands of transactions.
I'm creating a query to get a list of clients and their last transaction to know which ones are inactive (eg which have not made transactions in the last 30/90/180 days). And for that I'm using this query:
SELECT C.*, T.[CreationDate] AS LastTransactionDate FROM Client AS C
OUTER APPLY (
SELECT TOP 1 T.CreationDate
FROM T AS T
WHERE T.ClientId = C.ClientId
ORDER BY T.CreationDate DESC
) AS T;
And it works very well, but as the data grows so does the query delay. I've tested it on a table with approximately 50 million transactions and it took about 1 minute. What strategy can I adopt here to improve this performance?
Related
I have the next query which returns 1550 rows.
SELECT *
FROM V_InventoryMovements -- 2 seconds
ORDER BY V_InventoryMovements.TransDate -- 23 seconds
It takes about 2 seconds to return the results.
But when I include the ORDER BY clause, then it takes about 23 seconds.
It is a BIG change just for adding an ORDER BY.
I would like to know what is happening, and a way to improve the query with the ORDER BY. To quit the ORDER BY should not be the solution.
Here a bit of information, please let me know if you need more info.
V_InventoryMovements
CREATE VIEW [dbo].[V_InventoryMovements]
AS
SELECT some_fields
FROM FinTime
RIGHT OUTER JOIN V_Outbound ON FinTime.StdDate = dbo.TruncateDate(V_Outbound.TransDate)
LEFT OUTER JOIN ReasonCode_Grouping ON dbo.V_Outbound.ReasonCode = dbo.ReasonCode_Grouping.ReasonCode
LEFT OUTER JOIN Items ON V_Outbound.ITEM = Items.Item
LEFT OUTER JOIN FinTime ON V_Outbound.EventDay = FinTime.StdDate
V_Outbound
CREATE VIEW [dbo].[V_Outbound]
AS
SELECT V_Outbound_WMS.*
FROM V_Outbound_WMS
UNION
SELECT V_Transactions_Calc.*
FROM V_Transactions_Calc
V_OutBound_WMS
CREATE VIEW [dbo].[V_OutBound_WMS]
AS
SELECT some_fields
FROM Transaction_Log
INNER JOIN MFL_StartDate ON Transaction_Log.TransDate >= MFL_StartDate.StartDate
LEFT OUTER JOIN Rack ON Transaction_Log.CHARGE = Rack.CHARGE AND Transaction_Log.CHARGE_LFD = Rack.CHARGE_LFD
V_Transactions_Calc
CREATE VIEW [dbo].[V_Transactions_Calc]
AS
SELECT some_fields
FROM Transactions_Calc
INNER JOIN MFL_StartDate ON dbo.Transactions_Calc.EventDay >= dbo.MFL_StartDate.StartDate
And here I will also share a part of the execution plan (the part where you can see the main cost). I don't know exactly how to read it and improve the query. Let me know if you need to see the rest of the execution plan. But all the other parts are 0% of Cost. The main Cost is in the: Nested Loops (Left Outer Join) Cost 95%.
Execution Plan With ORDER BY
Execution Plan Without ORDER BY
I think the short answer is that the optimizer is executing in a different order in an attempt to minimize the cost of the sorting, and doing a poor job. Its job is made very hard by the views within views within views, as GuidoG suggests. You might be able to convince it to execute differently by creating some additional index or statistics, but its going to be hard to advise on that remotely.
A possible workaround might be to select into a temp table, then apply the ordering afterwards:
SELECT *
INTO #temp
FROM V_InventoryMovements;
SELECT *
FROM #temp
ORDER BY TransDate
I have a question regarding table design / query efficiency in SQL.
I have two tables, Table A contains list of clients, Table B contains clients ID and the last time a message has been received from a client.
The number of clients is growing and in tens of 1000, each client sends a message at least once a minute, sometimes more, sometimes less, but on average it is about that.
Table B is growing rather fast.
The question is this: I want to be able to pull a list of all clients and their last seen date and time.
The problem is as the table grows the query execution time is getting larger and requires scan of all of the rows in Table A and B.
I have introduced a new column to Table B which is just a date type column and created non clustered, non-unique an index on it, however it does not seem to make much difference.
The query is:
SELECT [TableA].[Client_ID] ISNULL(R.Most_Recent_TimeStamp, '2000-01-01') AS Most_Recent_Comms
FROM [TableA]
LEFT JOIN (SELECT [TableB].[Client_ID], MAX([TableB].[Time_Stamp]) AS Most_Recent_TimeStamp FROM [TableB] WITH(NOLOCK) GROUP BY [TableB].[Client_ID]) AS R ON [TableA].[Client_ID] = R.Client_ID
The execution time is in tens of seconds. Things have improved when I included WITH(NOLOCK) statement a fair amount. And you can imagine as the time progresses and TableB grows, execution time will be growing.
I do not think this is the right way to go.
I am sure there is a better way. What about creating a view or another table and writing a trigger, which will update the new table every time a row is inserted in to TableB. The new table will be always kept up to date and one can call simple SELECT query.
I would suggest one of the following:
SELECT b.ClientId, MAX(b.TimeStamp)
FROM TableB b
GROUP BY b.ClientId;
This assumes that all clients are in TableB. If not:
SELECT a.ClientId, b.TimeStamp
FROM TableA OUTER APPLY
(SELECT b.*
FROM TableB b
WHERE b.Client_Id = a.Client_Id
ORDER BY b.TimeStamp DESC
) b;
For both queries, you want an index on TableB(ClientId, TimeStamp).
I have 2 tables, TableA and TableB
TableA contains nearly 3,000,000 records
TableB contains about 10,000 records
I want to delete the entries in TableA that match a certain parameters. This query has qorked OK for smaller tables, but I get timeout exceptions when running in VB.Net
delete FROM TableA WHERE (((TableA.ID) In (SELECT [TableB].ID FROM TableB)) AND ((TableA.EVDATE)='20170720'));
In an effort to see what's going on, I changed this to a SELECT * FROM... in SSMS and at 5 minutes with no result, I stopped it...
Why does this stall and is there a better way of doing this?
I think this is much better to read:
delete FROM TableA WHERE TableA.EVDATE='20170720' and TableA.ID In
(SELECT [TableB].ID FROM TableB);
You will need to split the number of deleted records into multiple calls
Example:
Delete Each 1000 Rows Alone
DELETE FROM TableA
WHERE TableA.ID IN (SELECT TOP (1000) ID FROM TableA)
and you can make a loop over them and Check ##ROWS_COUNT if 0 then all rows were deleted and no need to recall or IF ##ROWS_COUNT < 1000 then no need to recall
The big query needs a lot of transaction log resources (DatabaseName_Log.mdb File not the database file) and when you split you will benefit from lower resources and lower logs which allows faster executions per smaller number of records to delete
I am experiencing a strange performance issue. I have a view based on a CTE. It's a view that I wrote years ago, and it has been running without issue. Suddenly, 4 days ago, the query that ran in 1 - 2 minutes, ran for hours before we identified the long running query and halted it.
The CTE produces a time-stamped list of transactions that an agent performs. I then Select from the CTE, left joining back to the CTE using the timestamp of the subsequent transaction to determine the length of time an agent spend on each transaction.
WITH [CTE_TABLE] (COLUMNS) AS
(
SELECT [INDEXED COLUMNS]
,[WINDOWED FUNCTION] AS ROWNUM
FROM [DB_TABLE]
WHERE [EMPLOYEE_ID] = 111213
)
SELECT [T1].[EMPLOYEE_ID]
,[T1].[TRANSACTION_NAME]
,[T1].[TIMESTAMP] AS [START_TIME]
,[T2].[TIMESTAMP] AS [END_TIME]
FROM [CTE_TABLE] [T1]
LEFT OUTER JOIN [CTE_TABLE] [T2] ON
(
[T1].[EMPLOYEE_ID] = [T2].[EMPLOYEE_ID]
AND [T1].[ROWNUM] = [T2].[ROWNUM] + 1
)
In testing I filter for a specific agent. If it run the inner portion of the CTE it produces 500 records in less than a second. (When not filtering for a single agent, it produces 95K records in 14 seconds. This is the normal running timeframe.) If I run the CTE with a simple SELECT * FROM [CTE_TABLE], it also runs in less than a second. When I run it using an INNER JOIN back to itself, again, runs in less than a second. Finally, when I run it as a LEFT OUTER JOIN it takes over a minute and a half just for the 500 records of a single agent. I need the LEFT OUTER JOIN because the final record of the day is the agent's log-off the system, and it never has a record to join to.
The table that I pull from is over 22GB in size, and has 500 Million rows. Selecting the records from this table for a single day takes 14 seconds, or a single agent in less than a second, so I don't think the performance bottleneck comes from the source table. The bottleneck occurs in the LEFT OUTER JOIN back to the CTE, but I have always had the LEFT OUTER JOIN. Again, the very strange aspect is that this only began 4 days ago. I have checked space on the server, there is plenty. The CPU spikes to approx. 25% and remains there until the query ends running, either on its own, or halted by an admin.
I am hoping someone has some ideas as to what could have caused this. It appears to have cropped up from nowhere.
Again, the very strange aspect is that this only began 4 days ago
I recommend updating statistics on the tables involved and also try rebuilding indexes
The bottleneck occurs in the LEFT OUTER JOIN back to the CTE
CTE will not have any statistics,i would recommend materalizing the CTE into a Temp table to see if this helps
We had an issue since a recent update on our database (I made this update, I am guilty here), one of the query used was much slower since then. I tried to modify the query to get faster result, and managed to achieve my goal with temp tables, which is not bad, but I fail to understand why this solution performs better than a CTE based one, which does the same queries. Maybe it has to do that some tables are in a different DB ?
Here's the query that performs badly (22 minutes on our hardware) :
WITH CTE_Patterns AS (
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email AS PELE WITH(NOLOCK) ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
),
CTE_Emails AS (
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED WITH(NOLOCK) ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
)
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM CTE_Patterns AS BL WITH(NOLOCK)
INNER JOIN CTE_Emails AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
When running both CTE queries separately, it's super fast (0 secs in SSMS, returns 122 rows and 13k rows), when running the full query, with INNER JOIN on sEmail, it's super slow (22 minutes)
Here's the query that performs well, with temp tables (0 sec on our hardware) and which does the eaxct same thing, returns the same result :
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
INTO #tb1
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email PELE ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
INTO #tb2
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM #tb1 AS BL WITH(NOLOCK)
INNER JOIN #tb2 AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
DROP TABLE #tb1
DROP TABLE #tb2
Tables stats :
OtherDb.dbo.Purchased_Email_List : 13 rows, 2 rows flagged bPattern = 1
OtherDb.dbo.Purchased_Email_List_Email : 324289 rows, 122 rows with patterns (which are used in this issue)
dbo.NewsletterService_import_list_email : 15.5M rows
dbo.NewsletterService_import_list_email_distinct ~1.5M rows
WHERE ILE.iId_newsletterservice_import_list = 1000 retrieves ~ 13k rows
I can post more info about tables on request.
Can someone help me understand this ?
UPDATE
Here is the query plan for the CTE query :
Here is the query plan with temp tables :
As you can see in the query plan, with CTEs, the engine reserves the right to apply them basically as a lookup, even when you want a join.
If it isn't sure enough it can run the whole thing independently, in advance, essentially generating a temp table... let's just run it once for each row.
This is perfect for the recursion queries they can do like magic.
But you're seeing - in the nested Nested Loops - where it can go terribly wrong.
You're already finding the answer on your own by trying the real temp table.
Parallelism. If you noticed in your TEMP TABLE query, the 3rd Query indicates Parallelism in both distributing and gathering the work of the 1st Query. And Parallelism when combining the results of the 1st and 2nd Query. The 1st Query also incidentally has a relative cost of 77%. So the Query Engine in your TEMP TABLE example was able to determine that the 1st Query can benefit from Parallelism. Especially when the Parallelism is Gather Stream and Distribute Stream, so its allowing the divying up of work (join) because the data is distributed in such a way that allows for divying up the work then recombining. Notice the cost of the 2nd Query is 0% so you can ignore that as no cost other than when it needs to be combined.
Looking at the CTE, that is entirely processed Serially and not in Parallel. So somehow with the CTE it could not figure out the 1st Query can be run in Parallel, as well as the relationship of the 1st and 2nd query. Its possible that with multiple CTE expressions it assumes some dependency and did not look ahead far enough.
Another test you can do with the CTE is keep the CTE_Patterns but eliminate the CTE_Emails by putting that as a "subquery derived" table to the 3rd Query in the CTE. It would be curious to see the Execution Plan, and see if there is Parallelism when expressed that way.
In my experience it's best to use CTE's for recursion and temp tables when you need to join back to the data. Makes for a much faster query typically.