I have numerous quite large tables (300-400 tables, ~30 million rows each). Everyday (once a day) I have to check if any new rows were inserted into any of these tables. Possible number of rows inserted may vary from 0 to 30 million rows. Rows are not going to be deleted.
At the moment, I check if any new rows were inserted using approximate count. And then compare it with previous (yesterday) result.
SELECT reltuples FROM pg_class WHERE oid='tablename'::regclass;
The main thing I doubt: how soon reltuples will be updated if, for example, 3000 rows will be inserted (or 5 rows inserted)? And is approximate count a good solution for that case?
My config parameters are:
autovacuum_analyze_threshold: 50
autovacuum_analyze_scale_factor: 0.1
reltuples will be updated whenever VACUUM (ir autovacuum) runs, so this number normally has an error margin of 20%.
You'll get a better estimate for the number of rows in the table from the table statistics view:
SELECT n_live_tup
FROM pg_stat_user_tables
WHERE schemaname = 'myschema' AND relname = 'mytable';
This number is updated by the statistics collector, so it is not guaranteed to be 100% accurate (there is a UDP socket involved), and it may take a little while for the effects of a data modification to be visible there.
Still it is often a more accurate estimate than reltuples.
Related
As part of a uni project, I am using MariaDB to cleanse some large CSV's with an algorithm and am using MariaDB 10.5.9 due to the size.
The data is 5 columns with date, time, PlaceID, ID(not unique and repeated), Location
It is a large dataset with approx 50+ million records per day, in total over 1 week 386 million records.
I started to run the algorithm over each day individually and this worked well, the whole process taking between 11 and 15 minutes.
When trying to run over the 7 days combined I have some significant impact on performance.
Most elements work, but I have 1 query which compares values in ID with a list of known good id's and deletes any not in the known good.
DELETE quick FROM merged WHERE ID NOT IN (SELECT ID FROM knownID) ;
On a daily table, this query takes around 2 minutes (comparing 50 million against 125 million known good, both tables have indexes to speed up the process on the ID columns on each table.
Table size for merged data is 24.5GB and for known good is 4.7GB
When running across the whole week, I expected around 7 times as long (plus a bit) the query took just under 2 hours? How can I increase this performance? I am loading both tables into a Memory table when performing the work and then copying back to a disc-based table once complete to try and speed up the process, server has 256GB RAM so plenty of room on there. Are there any other settings I can change/tweak?
my.ini is below:
innodb_buffer_pool_size=18G
max_heap_table_size=192G
key_buffer_size=18G
tmp_memory_table_size=64G
Many thanks
innodb_buffer_pool_size=18G -- too low; raise to 200G
max_heap_table_size=192G -- dangerously high; set to 2G
key_buffer_size=18G -- used only by MyISQM; set to 50M
tmp_memory_table_size=64G -- dangerously high; set to 2G
How many rows will be deleted by this?
DELETE quick FROM merged
WHERE ID NOT IN (SELECT ID FROM knownID) ;
Change to the "multi-table" syntax for DELETE and use LEFT JOIN ... IS NULL
If you are deleting more than, say, a thousand rows, do it in chunks. See http://mysql.rjweb.org/doc.php/deletebig
As discussed in that link, it may be faster to build a new table with just the rows you want to keep.
DELETE must keep the old rows until the end of the statement; then (in the background) do the actual delete. This is a lot of overhead.
For further discussion, please provide SHOW CREATE TABLE for both tables.
May I ask is there any way to get to know appropriate timing to update table/index statistics?
Recently performance is getting worse with one of major data mart table in our BI-DWH, SQL Server 2012.
All indexes are cared every weekend to reorganize/rebuild, according to their fragmentation percentage and now they're under 5% as avg_fragmentation_in_percent.
So we detect that's caused by obsolete table/index statistics or table fragmentations or so.
Generally, we set autostats on and that Table/index stats were updated at July 2018, maybe still it's not time to update according to their optimizer,
since that table is huge, total record is around 0.7 billions, daily increase about 0.5 million records.
Here is PK statistics and actual record count of that table.
-- statistics
dbcc show_statistics("DM1","PK_DM1")
Name Updated Rows Rows Sampled Steps Density AveragekeylengthString Index Filter Expression Unfiltered Rows
------------------------------------------------------------------------------------------------------------------------------------------------------
PK_DM1 07 6 2018 2:54PM 661696443 1137887 101 0 28 NO NULL 661696443
-- actual row count
select count(*) row_cnt from DM1;
row_cnt
-------------
706723646
-- Current Index Fragmmentations
SELECT a.index_id, name, avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats (DB_ID(N'DM1'),
OBJECT_ID(N'dbo.DM1'), NULL, NULL, NULL) AS a
JOIN sys.indexes AS b
ON a.object_id = b.object_id AND a.index_id = b.index_id;
GO
index_id name avg_fragmentation_in_percent
--------------------------------------------------
1 PK_DM1 1.32592173128252
7 IDX_DM1_01 1.06209021193359
9 IDX_DM1_02 0.450888386865285
10 IDX_DM1_03 4.78448190118396
So there is less than 10%, but over 45 millions difference between the statistics row counts and actual record counts.
I'm wondering if it can be worth to update the table/index stats manually in this case.
If there are any other information you decided the appropriate timing to update the stats, any advice would be so much appreciated.
Thank you.
-- Result
Thanks to #scsimon advice, I checked all index statistics in detail and main index was missing RANGE_HI_KEY -- that index based on registration date and there was no range after July 2018 last updated statistics.
(The claim was made by user when he searched for 2018 September records)
So I decided to update table/indexes statistics and confirmed the same query was improved from 1 hour 45 mins to 3.5 mins.
Deelpy appreciated all of the advices to my question.
Best Regards.
Well you have auto-update statistics to on so that's good. Also, each time an index is rebuilt, the statistics are recomputed. SQL Server 2008R2 onward, until 2016, has the same behavior as TF 2371 meaning the large table takes fewer rows to need changing to auto compute. Read more here on that.
Also you are showing stats for a single index not the whole table. That index could be filtered. And, remember that the Total number of rows sampled for statistics calculations. If Rows Sampled < Rows, the displayed histogram and density results are estimates based on the sampled rows. You can read more on that here
Back to the core problem of performance... you are focusing on statistics and the indexes which isn't a terrible idea, but it's not necessarily the root cause. You need to identify what query is running slow. Then, get help with that slow query but following the steps in that blog, and others. The big one here is to ask a question about that query with the execution plan. The problem could be the indexes, or it could be:
Memory contention / misallocation
CPU bottleneck
Parallelism (maybe your have MAXDOP set to 0)
Slow disks
Little memory, causing physical reads
The execution plan isn't optimal anymore and perhaps, you need to recompile that query
etc, etc etc... this is where the execution plan and wait stats will shed light
I have a table t1 with a primary key on the identity column of this table. Within the insert trigger of t1, an UPDATE statement is being issued to update some columns of the newly inserted rows, the join being on t1.primarykeycolumn and inserted.primarykeycolumn.
When the number of inserted rows starts to creep up, I have noticed 'suboptimal' execution plans. I guess the optimizer is referring to the statistics on t1 to arrive at the execution plan. But for newly inserted rows, the statistics page is always going to be stale, after all the IDENTITY columns is always going to be monotonically increasing when the SQL Server is supplying the values.
To prove that statistics are the 'issue', I issued an UPDATE STATISTICS command as the first statement in the trigger and the optimizer is able to come up with a very good plan for a wide variety of rows. But I certainly cannot issue UPDATE STATISTICS in production code for a mostly OLTP system.
Most of the times, the number of rows inserted will be in the few tens and only occasionally in the couple of thousands. When the number of rows in the tens, the execution plan shows only nested loop joins while it switches to using a series of Merge Joins + Stream Aggregates at some point as the number of rows starts to creep up.
I want to avoid writing convoluted code within the trigger, one part for handling large number of rows and the other for the smaller number of rows. After all, this is what the server is best at doing. Is there a way to tell the optimizer 'even though you do not see the inserted values in the statistics, the distribution is going to be exactly like those that have been inserted before. please come up with the plan based on this assumption'? Any pointers appreciated.
After experimenting a bit, I have the following observation :
In the absence of a statistics, the optimizer is coming up with optimal plans for a very wide number of rows. It is only when the statistics are updated before issuing the inserts (i.e when there are statistics are available), the optimizer comes up with 'bad' plans on the join between inserted & base table inside the trigger as the number of rows starts to go up.
Is there a way to tell the optimizer 'ignore whatever is on the statistics page, go do whatever you were doing in the absence of statistics"?
In this specific case, INSTEAD OF INSERT triggers are a viable option. See http://www.sqlservercentral.com/Forums/Topic1826794-3387-1.aspx
Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0
if I run this query
select user from largetable where largetable.user = 1155
(note I'm querying user just to reduce this to its simplest case)
And look at the execution plan, an index seek is planned [largetable has an index on user], and the estimated rows is the correct 29.
But if I do
select user from largetable where largetable.user = (select user from users where externalid = 100)
[with the result of the sub query being the single value 1155 just like above when i hard code it]
The query optimizer estimates 117,000 rows in the result. There are about 6,000,000 rows in largetable, 1700 rows in users. When I run the query of course I get back the correct 29 rows despite the huge estimated rows.
I have updated stats with fullscan on both tables on the relevent indexes, and when I look at the stats, they appear to be correct.
Of note, for any given user, there are no more than 3,000 rows in largetable.
So, why would the estimated execution plan show such a large number of estimated rows? Shouldn't the optimizer know, based on the stats, that it's looking for a result that has 29 corresponding rows, or a MAXIMUM of 3,000 rows even if it doesn't know the user which will be selected by the subquery? Why this huge estimate? The problem is, that this large estimate is then influencing another join in a larger query to do a scan instead of a seek. If I run the larger query with the subquery, it takes 1min 40 secs. If run it with the 1155 hard coded it takes 2 seconds. This is very unusual to me...
Thanks,
Chris
The optimizer does the best it can, but statistics and row count estimations only go so far (as you're seeing).
I'm assuming that your more complex query can't easily be rewritten as a join without a subquery. If it can be, you should attempt that first.
Failing that, it's time for you to use your additional knowledge about the nature of your data to help out the optimizer with hints. Specifically look at the forceseek option in the index hints. Note that this can be bad if your data changes later, so be aware.
Did you try this?
SELECT lt.user
FROM Users u
INNER JOIN largeTable lt
ON u.User = lt.User
WHERE u.externalId = 100
Please see this: subqueries-vs-joins