SQL statistics on large databases

SQL statistics on large databases - sql-server

What is a good way to handle statistics on SQL 2008 for very large databases? Multiple tables with 100m+ rows in each.
Should auto update statistics be on? Will auto update statistics async help at all? should a job be setup to manually update statistics on some kind of schedule?
Usually data is added to the table but older data isn't changed very often.
Update: About 100k rows inserted each hour. Mostly reporting is done on the data. Updates can happen on 1-2 columns on ~500k rows per day.

For one I would not want update stats to run in the middle of the day on a large table, so I would say no. Also you need to hit the threshold (20% I believe) before it kicks in anyway
Now if you have a job already that rebuilds the index then stats are updated automatically (this is not true with a reorg/defrag)
Also 100 million rows doesn't mean much, how many columns if the table is 12 bytes wide (per row) compared to 4100 bytes that is a big difference (especially since with the 4100 bytes per row table you can only fir 1 row per page)

What is a good way to handle statistics on SQL 2008 for very large databases? Multiple tables with
100m+ rows in each.
PLEASE dont call this very large. I give you an example of very large. We just run a sql statement on some data in our warehouse. Temp space usage tops at 180gb. For that statement. Db? two digits terabytes. 100m+ rows are not small, but not very large.
Should auto update statistics be on? Will auto update statistics async help at all? should a job
be setup to manually update statistics on some kind of schedule?
Depend. On update and usage patterns.
Usually data is added to the table but older data isn't changed very often.
How often? How much in percentage? What data? Do the statistics get out of scope fast or slowly move? YOu ahve to provide a LOT more information to make sensible suggestions.

Should auto update statistics be on?
It Depends...
Will auto update statistics async help
at all?
It will help prevent a stats update that takes a long time from killing a query. Basically this tells SQL Server that if a query comes in and it realizes stats are outdated instead of holding the query, updating the stats, then running the query. just run the query and update the stats behind the scenes. So that particular query that kicked off the need for a stats update won't get any benefits but it also won't sit around waiting for the stats to update first either.
should a job be setup to manually
update statistics on some kind of
schedule?
Yes! Stats are only updated if 20% of a tables data has been "changed". On very big tables that can basically be the same as saying stats will never be updated. If you have any large tables where new data is being added you should always have a scheduled process to update stats on them.

"It depends" is a good answer but in the absence of reproducable and measurable improvement I'd leave it at the default.
If you update stats manually overnight then you have less chance of an auto-update kicking in. And you can defer a stats update by setting AUTO_UPDATE_STATISTICS_ASYNC (See "When to Use Synchronous or Asynchronous Statistics Updates")
On balance I wouldn't disable it or change the default which is "on".

Related

MonetDB refresh data in background best strategy with active connections making queries

I'm testing MonetDB and getting an amazing performance while querying millions of rows on my laptop.
I expect to work with billions in production and I need to update the data as often as possible, let say each 1 minute or 5 minutes worst case. Just updating existing records or adding new ones, deletion can be scheduled once a day.
I've seen a good performance for the updates on my tests, but i'm a bit worried about same operations over three of four times more data.
About BULK insert, got 1 million rows in 5 secs, so good enough performance right now as well. I have not tried deletion.
Everything works fine unless you run queries at the same time you update the data, in this case all seems to be frozen for a long-long-long time.
So, what's the best strategy to get MonetDB updated in background?
Thanks

You could do each load in a new table with the same schema, then create a VIEW that unions them all together. Queries will run on the view, and dropping and recreating that view is very fast.
However, it would probably be best to merge some of these smaller tables together every now and then. For example, a nightly job could combine all load tables from the previous day(s) into a new table (runs independently, no problem) and then recreate the view again.
Alternatively, you could use the BINARY COPY INTO to speed up the loading process in the first place.

There is a new merge table functionnality that could replace the view in Hannes Mühleisen answer and would be more idiomatic.
You can attach / detach partitions using:
alter table mergedTable ADD/DROP table partitionTable
It will be problematic for updates as they must be made directly to the partition tables easier if you have a partitionning key (date/...)
But it was the same with the previous solution.

choosing table design for database performance

I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.

Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.

I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.

MS-SQL Server 2000 slow full text indexing

We have a full text index on a fairly large table of 633,569 records. The index is rebuilt from scratch as part of a maintenance plan every evening, after a bunch of DTS packages run that delete / insert records. Large chunks of data are deleted, then inserted (to take care of updates and inserts), so incremental indexing is not a possibility. Changing the packages to only delete when necessary is not a possibility either as it is a legacy application that will eventually be replaced.
The FTI includes two columns - one a varchar(50) not null and a varchar(255) null.
There is a clustered index on the primary key column, which is just an identity column. There is also an combined index on an integer column and the varchar(50) column mentioned above. This latter index was added for performance reasons.
The problem is that the re-indexing is painfully slow - about 8 hours.
The server is fairly robust (dual processor, 4gb of ram), and everything runs quickly beyond this re-indexing.
Any tips on how to speed this up?
UPDATE
Our client has access to the sql box. Turns out they turned on change tracking on the table that is part of the full text index. We turned this off, and the full population took less than 3 hours. Still not great, but better than 8.
UPDATE 2
The FTI is again taking ~8 hours to populate.

SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme.
Use change tracking with the "update
index in background" option.
The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process.
First, you (or Enterprise Manager) run this system stored procedure:
sp_fulltext_getdata CatalogID, object_id
After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table.So if you have 50 million rows in your database, this procedure will be issued 50 million times.
On the other hand, if you use an incremental population, MSSearch will issue an initial:
sp_fulltext_getdata CatalogID, object_id
for each row in the table that you're full-text indexing. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing.
Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming.
I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:
sp_fulltext_getdata CatalogID, object_id
for every row in the table with change tracking enabled.Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table and will then remove them from the sysfulltextnotify table.
Consider using a separate build
server
Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date and an MSSearch engine that's sometimes unavailable consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog .Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution
Limit activity when population is
running
When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources.
Increase the number of threads for
the indexing process
Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems.
Stop any anti-virus or open
file-agent backup software.
If this is not possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories
Place the catalog,temp directory and
pagefiles on their own controllers
If you can make that investment.Place the catalog on its own controller, preferably on a RAID-1 array.Place the temp directory on a RAID-1 array. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.
Consider creating secondary data
files for the Temp DB - 1 per CPU /
Core.

Do you have enough RAM?
What are your file drive placements in terms of RAID configuration?
Are you seeing high tempDB activity?
(BTW, half a million records is not large; it's not even medium... ;) )

Is the system offline whilst you are doing the reindex or live ?
Are these the only items in your full text catalog; if not you might want to consider separating them out from the remainder of your FTS data. (Might help with monitoring too) In the index is the identity column configured as the unique key ?
Can you quantify the large amounts of changes? There are 3 basic options for repopulation; You might want to try switching to full or incremental as one may suit you better than the one you are using now. In my experience incremental works well if changes to the total DB are less than 40% (had a similar issue during large data take ons into the database.) If >40% change then full is likely better (from my experience - i index documents so it might work differently for you) The third option you might want to consider try the Change Tracking with scheduled update reindex option.
If you can take the server off-line to users then what performance settings do you have FTS running under whilst reindexing? You can check this Full-Text Search Service Properties / Performance tab - System Resource Usage as a slider (think there are 4 or 5 positions). There is probably a system proc to change this dont know it and dont have a 2000 machine to check anymore.
FTS / Reindexing loves ram and lots of it; the general rule of thumb is have virtual memory 3x the physical memory; if you have several physical disks then create several Pagefile.sys files, so that each Pagefile.sys file will be placed on its own physical disk. Are you on NT or Windows 2000 ? check that extended memory over 2gb is actually configured properly.

Try putting the index on a separate physical disk than the database.
EDIT: Scott reports this is already the case.

Disallowing nulls in the column that currently does might not speed up the index, but in my experience is a better practice, especially for indexing purposes. The only columns I can justify allowing nulls in are date columns.

Here is a checklist of parameters for FT-indexing performance on SQL Server. Most of them are already quoted and checked here. I don't find one of them on your comments though:
The SQL Server MAX SERVER MEMORY setting should be set manually (dynamic memory allocation is turned off) so that enough virtual memory is left for the Full-Text Search service to run. To achieve this, select a MAX SERVER MEMORY setting that once set, leaves enough virtual memory so that the Full-Text Search service is able to access an amount of virtual memory equal to 1.5 times the amount of physical RAM in the server. This will take some trial and error to achieve this setting.

Improve the Performance of Full-Text Indexes: http://msdn.microsoft.com/en-us/library/ms142560.aspx

What are the implications of having out of date table statistics on a Sybase/SQLServer database?

For example, for heavily used tables with volumes in the order of 10 million rows that grow by a million rows a month, if the stats are 6-8 months old how detrimental to the performance of the database is this going to be? How often should you be refreshing the stats?

Statistics are kept and used by the query planner, and they have a noticeable impact. I can't give you exact guidelines on how often you should refresh them. That will depend on how much work it takes to refresh them and how much impact fresh stats have on your queries. The real answer for this is to take good measurements and judge options by the results. Tinkering without measurement is a throw of the dice.

We refresh stats every night. No sense waiting for the Weekend if the stats could be refreshed nightly - by Friday they will be worse than they were on Monday ...
Problem is what if it takes too long?
For databases which have that problem we refresh stats on certain tables each night - so some tables are done every night, some less often. (We have a database table of which tables to do when, and a history of how long the Stats took to regenerate, and tune the schedule accordingly)
if the stats are 6-8 months old how detrimental to the performance of the database is this going to be
I would be very surprised if it didn't make a huge difference on a table growing by 1 million rows-per-month
If that is your actual state I would expect that the tables need defragging too

Implications are dire. You should be refreshing them as often as you can to give the optimizer the best information to make decisions. You will be able to find out how bad the statistics are by running the optdiag utility. Analysing the output and running again to compare over a few days or a week will let you know exactly how bad the situation is. I would recommend that at the earliest convieniance you drop and recreate the indexes and run 'update index statistics' on the table in question. This should be enough information to get you through. I am assuming that you are able to analyse the output of optdiag though.

How often should Oracle database statistics be run?

In your experience, how often should Oracle database statistics be run? Our team of developers recently discovered that statistics hadn't been run our production box in over 2 1/2 months. That sounds like a long time to me, but I'm not a DBA.

Since Oracle 11g statistics are gathered automatically by default.
Two Scheduler windows are predefined upon installation of Oracle Database:
WEEKNIGHT_WINDOW starts at 10 p.m. and ends at 6 a.m. every Monday
through Friday.
WEEKEND_WINDOW covers whole days Saturday and Sunday.
When statistics were last gathered?
SELECT owner, table_name, last_analyzed FROM all_tables ORDER BY last_analyzed DESC NULLS LAST; --Tables.
SELECT owner, index_name, last_analyzed FROM all_indexes ORDER BY last_analyzed DESC NULLS LAST; -- Indexes.
Status of automated statistics gathering?
SELECT * FROM dba_autotask_client WHERE client_name = 'auto optimizer stats collection';
Windows Groups?
SELECT window_group_name, window_name FROM dba_scheduler_wingroup_members;
Window Schedules?
SELECT window_name, start_time, duration FROM dba_autotask_schedule;
Manually gather Database Statistics in this Schema:
EXEC dbms_stats.gather_schema_stats(ownname=>NULL, cascade=>TRUE); -- cascade=>TRUE means include Table Indexes too.
Manually gather Database Statistics in all Schemas!
-- Probably need to CONNECT / AS SYSDBA
EXEC dbms_stats.gather_database_stats;

Whenever the data changes "significantly".
If a table goes from 1 row to 200 rows, that's a significant change. When a table goes from 100,000 rows to 150,000 rows, that's not a terribly significant change. When a table goes from 1000 rows all with identical values in commonly-queried column X to 1000 rows with nearly unique values in column X, that's a significant change.
Statistics store information about item counts and relative frequencies -- things that will let it "guess" at how many rows will match a given criteria. When it guesses wrong, the optimizer can pick a very suboptimal query plan.

At my last job we ran statistics once a week. If I remember correctly, we scheduled them on a Thursday night, and on Friday the DBAs were very careful to monitor the longest running queries for anything unexpected. (Friday was picked because it was often just after a code release, and tended to be a fairly low traffic day.) When they saw a bad query they would find a better query plan and save that one so it wouldn't change again unexpectedly. (Oracle has tools to do this for you automatically, you tell it the query to optimize and it does.)
Many organizations avoid running statistics out of fear of bad query plans popping up unexpectedly. But this usually means that their query plans get worse and worse over time. And when they do run statistics then they encounter a number of problems. The resulting scramble to fix those issues confirms their fears about the dangers of running statistics. But if they ran statistics regularly, used the monitoring tools as they are supposed to, and fixed issues as they came up then they would have fewer headaches, and they wouldn't encounter them all at once.

What Oracle version are you using? Check this page which refers to Oracle 10:
http://www.acs.ilstu.edu/docs/Oracle/server.101/b10752/stats.htm
It says:
The recommended approach to gathering statistics is to allow Oracle to automatically gather the statistics. Oracle gathers statistics on all database objects automatically and maintains those statistics in a regularly-scheduled maintenance job.

When I was managing a large multi-user planning system backed by Oracle, our DBA had a weekly job that gathered statistics. Also, when we rolled out a significant change that could affect or be affected by statistics, we would force the job to run out of cycle to get things caught up.

With 10g and higher version of oracle, up to date statistics on tables and indexes are needed by the optimizer to make "good" execution plan decision. How often you collect statistics is a tricky call. It depends on your application, schema, data rate and business practice. Some third party apps which are written to be backward compatible with older version of oracle do not perform well with the new optimizer. Those application require that tables have no stats so that the db resorts back to rule base execution plan. But on the average oracle recommends that stats be collected on tables with stale statistics. You can set tables to be monitor and check their state and have them analyze if/when stale. Often that is enough, sometime it is not. It really depend on your database. For my database we have a set of OLTP tables that need nightly stats collection to maintain performance. Other tables are analyze once a week. On our large dw database, we analyze as needed as the tables are too large for regular analysis without affecting overall db load and performance. So the correct answer is, it depends on the application, data change and business needs.

Make sure to balance the risk that fresh statistics cause undesirable changes to query plans against the risk that stale statistics can themselves cause query plans to change.
Imagine you have a bug database with a table ISSUE and a column CREATE_DATE where the values in the column increase more or less monotonically. Now, assume that there is a histogram on this column that tells Oracle that the values for this column are uniformly distributed between January 1, 2008 and September 17, 2008. This makes it possible for the optimizer to reasonably estimate the number of rows that would be returned if you were looking for all issues created last week (i.e. September 7 - 13). If the application continues to be used and the statistics are never updated, though, this histogram will be less and less accurate. So the optimizer will expect queries for "issues created last week" to be less and less accurate over time and may eventually cause Oracle to change the query plan negatively.

In the case of a data warehouse-type system you can consider collecting no statistics at all, and relying on dynamic sampling (setting optimizer_dynamic_sampling to level 2 or above).

Generally it's not recommended to gather statistics so frequent on the whole database unless you have a strong justification for that, such as a bulk insert or big data change happen frequently on the database.
gathering statistics on the database in this frequency MAY change the queries execution plan to a new poor execution plans, the thing may cost you much time trying to tune every query affected by the new poor plans, this is why you should test the impact of gathering new statistics on a test database, or in case you don't have the time or the man power for that, at least you should keep a fallback plan by backing up the original statics before you gather new ones, so in case you gather a new statistics and then the queries didn't perform as expected, you can easily restore back the original statistics.
There is a very useful script can help you backup original statistics and gather new ones and provide you with SQL command you can use to restore back the original statics in case the thing didn't go as expected after gathering new statistics. You can find the script in this link:
http://dba-tips.blogspot.com/2014/09/script-to-ease-gathering-statistics-on.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight