SQL Server : How to detect appropriate timing to update table/index statistics - sql-server

May I ask is there any way to get to know appropriate timing to update table/index statistics?
Recently performance is getting worse with one of major data mart table in our BI-DWH, SQL Server 2012.
All indexes are cared every weekend to reorganize/rebuild, according to their fragmentation percentage and now they're under 5% as avg_fragmentation_in_percent.
So we detect that's caused by obsolete table/index statistics or table fragmentations or so.
Generally, we set autostats on and that Table/index stats were updated at July 2018, maybe still it's not time to update according to their optimizer,
since that table is huge, total record is around 0.7 billions, daily increase about 0.5 million records.
Here is PK statistics and actual record count of that table.
-- statistics
dbcc show_statistics("DM1","PK_DM1")
Name Updated Rows Rows Sampled Steps Density AveragekeylengthString Index Filter Expression Unfiltered Rows
------------------------------------------------------------------------------------------------------------------------------------------------------
PK_DM1 07 6 2018 2:54PM 661696443 1137887 101 0 28 NO NULL 661696443
-- actual row count
select count(*) row_cnt from DM1;
row_cnt
-------------
706723646
-- Current Index Fragmmentations
SELECT a.index_id, name, avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats (DB_ID(N'DM1'),
OBJECT_ID(N'dbo.DM1'), NULL, NULL, NULL) AS a
JOIN sys.indexes AS b
ON a.object_id = b.object_id AND a.index_id = b.index_id;
GO
index_id name avg_fragmentation_in_percent
--------------------------------------------------
1 PK_DM1 1.32592173128252
7 IDX_DM1_01 1.06209021193359
9 IDX_DM1_02 0.450888386865285
10 IDX_DM1_03 4.78448190118396
So there is less than 10%, but over 45 millions difference between the statistics row counts and actual record counts.
I'm wondering if it can be worth to update the table/index stats manually in this case.
If there are any other information you decided the appropriate timing to update the stats, any advice would be so much appreciated.
Thank you.
-- Result
Thanks to #scsimon advice, I checked all index statistics in detail and main index was missing RANGE_HI_KEY -- that index based on registration date and there was no range after July 2018 last updated statistics.
(The claim was made by user when he searched for 2018 September records)
So I decided to update table/indexes statistics and confirmed the same query was improved from 1 hour 45 mins to 3.5 mins.
Deelpy appreciated all of the advices to my question.
Best Regards.

Well you have auto-update statistics to on so that's good. Also, each time an index is rebuilt, the statistics are recomputed. SQL Server 2008R2 onward, until 2016, has the same behavior as TF 2371 meaning the large table takes fewer rows to need changing to auto compute. Read more here on that.
Also you are showing stats for a single index not the whole table. That index could be filtered. And, remember that the Total number of rows sampled for statistics calculations. If Rows Sampled < Rows, the displayed histogram and density results are estimates based on the sampled rows. You can read more on that here
Back to the core problem of performance... you are focusing on statistics and the indexes which isn't a terrible idea, but it's not necessarily the root cause. You need to identify what query is running slow. Then, get help with that slow query but following the steps in that blog, and others. The big one here is to ask a question about that query with the execution plan. The problem could be the indexes, or it could be:
Memory contention / misallocation
CPU bottleneck
Parallelism (maybe your have MAXDOP set to 0)
Slow disks
Little memory, causing physical reads
The execution plan isn't optimal anymore and perhaps, you need to recompile that query
etc, etc etc... this is where the execution plan and wait stats will shed light

Related

MariaDB Compare 2 tables and delete where not in 1st (Large Dataset)

As part of a uni project, I am using MariaDB to cleanse some large CSV's with an algorithm and am using MariaDB 10.5.9 due to the size.
The data is 5 columns with date, time, PlaceID, ID(not unique and repeated), Location
It is a large dataset with approx 50+ million records per day, in total over 1 week 386 million records.
I started to run the algorithm over each day individually and this worked well, the whole process taking between 11 and 15 minutes.
When trying to run over the 7 days combined I have some significant impact on performance.
Most elements work, but I have 1 query which compares values in ID with a list of known good id's and deletes any not in the known good.
DELETE quick FROM merged WHERE ID NOT IN (SELECT ID FROM knownID) ;
On a daily table, this query takes around 2 minutes (comparing 50 million against 125 million known good, both tables have indexes to speed up the process on the ID columns on each table.
Table size for merged data is 24.5GB and for known good is 4.7GB
When running across the whole week, I expected around 7 times as long (plus a bit) the query took just under 2 hours? How can I increase this performance? I am loading both tables into a Memory table when performing the work and then copying back to a disc-based table once complete to try and speed up the process, server has 256GB RAM so plenty of room on there. Are there any other settings I can change/tweak?
my.ini is below:
innodb_buffer_pool_size=18G
max_heap_table_size=192G
key_buffer_size=18G
tmp_memory_table_size=64G
Many thanks
innodb_buffer_pool_size=18G -- too low; raise to 200G
max_heap_table_size=192G -- dangerously high; set to 2G
key_buffer_size=18G -- used only by MyISQM; set to 50M
tmp_memory_table_size=64G -- dangerously high; set to 2G
How many rows will be deleted by this?
DELETE quick FROM merged
WHERE ID NOT IN (SELECT ID FROM knownID) ;
Change to the "multi-table" syntax for DELETE and use LEFT JOIN ... IS NULL
If you are deleting more than, say, a thousand rows, do it in chunks. See http://mysql.rjweb.org/doc.php/deletebig
As discussed in that link, it may be faster to build a new table with just the rows you want to keep.
DELETE must keep the old rows until the end of the statement; then (in the background) do the actual delete. This is a lot of overhead.
For further discussion, please provide SHOW CREATE TABLE for both tables.

How do I figure out what is causing Data IO spikes on my Azure SQL database?

I have a Azure SQL production database that runs at around 10-20% DTU usage on average, however, I get DTU spikes that take it upwards of 100% at times. Here is a sample from the past 1 hour:
I realize this could be a rouge query, so I switched over to the Query Performance Insight tab, and I find the following from the past 24 hours:
This chart makes sense with regards to the CPU usage line. Query 3780 takes the majority of at CPU, as expected with my application. The Overall DTU (red) line seems to follow this correctly (minus the spikes).
However, in the DTU Components charts I can see large Data IO spikes occurring that coincide with the Overall DTU spikes. Switching over to the TOP 5 queries by Data IO, I see the following:
This seems to indicate that there are no queries that are using high amounts of Data IO.
How do I find out where this Data IO usage is coming from?
Finally, I see that there is this one, "odd ball" query (7966) listed under the TOP 5 queries by Data IO with only 5 executions. Selecting it shows the following:
SELECT StatMan([SC0], [SC1], [SC2], [SB0000])
FROM (SELECT TOP 100 PERCENT [SC0], [SC1], [SC2], step_direction([SC0]) over (order by NULL) AS [SB0000]
FROM (SELECT [UserId] AS [SC0], [Type] AS [SC1], [Id] AS [SC2] FROM [dbo].[Cipher] TABLESAMPLE SYSTEM (1.828756e+000 PERCENT)
WITH (READUNCOMMITTED) ) AS _MS_UPDSTATS_TBL_HELPER
ORDER BY [SC0], [SC1], [SC2], [SB0000] ) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 16)
What is this query?
This does not look like any query that my application has created/uses. The timestamps on the details chart seem to line up with the approximate times of the overall Data IO spikes (just prior to 6am) which leads me to think this query has something to do with all of this.
Are there any other tools can I use to help isolate this issue?
The query is updating statistics..this occurs when this setting AUTO UPDATE STATISTICS is on..This should be kept on and you can't turn it off..this is a best practice..
You should update stats manually only when when you see a query not performing well and stats are off for that query..
Also below are some rules when SQL will update stats automatically for you
When a table with no rows gets a row
When 500 rows are changed to a table that is less than 500 rows
When 20% + 500 are changed in a table greater than 500 rows
By ‘change’ we mean if a row is inserted, updated or deleted. So, yes, even the automatically-created statistics get updated and maintained as the data changes.There were some changes to these rules in recent versions and sql can update stats more often
References:
https://www.sqlskills.com/blogs/erin/understanding-when-statistics-will-automatically-update/
It seems that query is part of the automatic update of statistics process. To mitigate the impact of this process on production you can regularly update statistics and indexes using runbooks as explained here. Run sp_updatestats to immediately try to mitigate the impact of this process.

SQL Server : wrong index is used when filter value exceeds the index histogram range

We have a very large table, where every day 1-2 million rows are being added to the table.
In this query:
SELECT jobid, exitstatus
FROM jobsData
WHERE finishtime >= {ts '2012-10-04 03:19:26'} AND task = 't1_345345_454'
GROUP BY jobid, exitstatus
Indexes exists for both Task and FinishTime.
We expected that the task index will be used since it has much fewer rows. The problem that we see is that SQL Server creates a bad query execution plan which uses the FinishTime index instead of the task, and the query takes very long time.
This happens when the finish time value is outside the FinishTime index histogram.
Statistics are updated every day / several hours, but there are still many cases where the queries are for recent values.
The question: we can see clearly in the estimated execution plan that the estimated number of rows for the FinishTime is 1 in this case, so the FinishTime index is selcted. Why SQL Server assumes that this is 1 if there is no data? Is there a way to tell it to use something more reasonable?
When we replace the date with a bit earlier one, statistics exists in the histogram and the estimated number of rows is ~7000
You can use a Plan Guide to instruct the optimizer to use a specific query plan for you. This fits well for generated queries that you cannot modify to add hints.

Unexpected estimated rows in query execution plan (Sql Server 2000)

if I run this query
select user from largetable where largetable.user = 1155
(note I'm querying user just to reduce this to its simplest case)
And look at the execution plan, an index seek is planned [largetable has an index on user], and the estimated rows is the correct 29.
But if I do
select user from largetable where largetable.user = (select user from users where externalid = 100)
[with the result of the sub query being the single value 1155 just like above when i hard code it]
The query optimizer estimates 117,000 rows in the result. There are about 6,000,000 rows in largetable, 1700 rows in users. When I run the query of course I get back the correct 29 rows despite the huge estimated rows.
I have updated stats with fullscan on both tables on the relevent indexes, and when I look at the stats, they appear to be correct.
Of note, for any given user, there are no more than 3,000 rows in largetable.
So, why would the estimated execution plan show such a large number of estimated rows? Shouldn't the optimizer know, based on the stats, that it's looking for a result that has 29 corresponding rows, or a MAXIMUM of 3,000 rows even if it doesn't know the user which will be selected by the subquery? Why this huge estimate? The problem is, that this large estimate is then influencing another join in a larger query to do a scan instead of a seek. If I run the larger query with the subquery, it takes 1min 40 secs. If run it with the 1155 hard coded it takes 2 seconds. This is very unusual to me...
Thanks,
Chris
The optimizer does the best it can, but statistics and row count estimations only go so far (as you're seeing).
I'm assuming that your more complex query can't easily be rewritten as a join without a subquery. If it can be, you should attempt that first.
Failing that, it's time for you to use your additional knowledge about the nature of your data to help out the optimizer with hints. Specifically look at the forceseek option in the index hints. Note that this can be bad if your data changes later, so be aware.
Did you try this?
SELECT lt.user
FROM Users u
INNER JOIN largeTable lt
ON u.User = lt.User
WHERE u.externalId = 100
Please see this: subqueries-vs-joins

SQL Query Costing, aggregating a view is faster?

I have a table, Sheet1$ that contains 616 records. I have another table, Rates$ that contains 47880 records. Rates contains a response rate for a given record in the sheet for 90 days from a mailing date. Within all 90 days of a records Rates relation the total response is ALWAYS 1 (100%)
Example:
Sheet1$: Record 1, 1000 QTY, 5% Response, Mail 1/1/2009
Rates$: Record 1, Day 1, 2% Response
Record 1, Day 2, 3% Response
Record 1, Day 90, 1% Response
Record N, Day N, N Response
So in that, I've written a view that takes these tables and joins them to the right on the rates to expand the data so I can perform some math to get a return per day for any given record.
SELECT s.[Mail Date] + r.Day as Mail_Date, s.Quantity * s.[Expected Response Rate] * r.Response as Pieces, s.[Bounce Back Card], s.Customer, s.[Point of Entry]
FROM Sheet1$ as s
RIGHT OUTER JOIN Rates$ as r
ON s.[Appeal Code] = r.Appeal
WHERE s.[Mail Date] IS NOT NULL
AND s.Quantity <> 0
AND s.[Expected Response Rate] <> 0
AND s.Quantity IS NOT NULL
AND s.[Expected Response Rate] IS NOT NULL);
So I save this as a view called Test_Results. Using SQL Server Management Studio I run this query and get a result of 211,140 records. Elapsed time was 4.121 seconds, Est. Subtree Cost was 0.751.
Now I run a query against this view to aggregate a piece count on each day.
SELECT Mail_Date, SUM(Pieces) AS Piececount
FROM Test_Results
GROUP BY Mail_Date
That returns 773 rows and it only took 0.452 seconds to execute! 1.458 Est. Subtree Cost.
My question is, with a higher estimate how did this execute SO much faster than the original view itself?! I would assume a piece might be that its returning rows to management studio. If that is the case, how would I go about viewing the true cost of this query without having to account for the return feedback?
SELECT * FROM view1 will have a plan
SELECT * FROM view2 (where view2 is based on view1) will have its own complete plan
The optimizer is smart enough to make the plan for view2 combine/collapse the operations into a most efficient operation. It is only going to observe the semantics of the design of view1, but it is not necessarily required to use the plan for SELECT * FROM view1 and than apply another plan for view2 - this will, in general, be a completely different plan, and it will do whatever it can to get the most efficient results.
Typically, it's going to push the aggregation down to improve the selectivity, and reduce the data requirements, and that's going to speed up the operation.
I think that Cade has covered the most important part - selecting from a view doesn't necessarily entail returning all of the view rows and then selecting against that. SQL Server will optimize the overall query.
To answer your question though, if you want to avoid the network and display costs then you can simply select each query result into a table. Just add "INTO Some_Table" after the column list in the SELECT clause.
You should also be able to separate things out by showing client statistics or by using Profiler, but the SELECT...INTO method is quick and easy.
Query costs are unitless, and are just used by the optimizer to choose what it thinks the most efficient execution path for a particular query is. They can't really be compared between queries. This, although old, is a good quick read. Then you'll probably want to look around for some books or articles on the MSSQL optimizer and about reading query plans if you're really interested.
(Also, make sure you're viewing the actual execution plan, and not the explain plan ... they can be different)

Resources