In your experience, how often should Oracle database statistics be run? Our team of developers recently discovered that statistics hadn't been run our production box in over 2 1/2 months. That sounds like a long time to me, but I'm not a DBA.
Since Oracle 11g statistics are gathered automatically by default.
Two Scheduler windows are predefined upon installation of Oracle Database:
WEEKNIGHT_WINDOW starts at 10 p.m. and ends at 6 a.m. every Monday
through Friday.
WEEKEND_WINDOW covers whole days Saturday and Sunday.
When statistics were last gathered?
SELECT owner, table_name, last_analyzed FROM all_tables ORDER BY last_analyzed DESC NULLS LAST; --Tables.
SELECT owner, index_name, last_analyzed FROM all_indexes ORDER BY last_analyzed DESC NULLS LAST; -- Indexes.
Status of automated statistics gathering?
SELECT * FROM dba_autotask_client WHERE client_name = 'auto optimizer stats collection';
Windows Groups?
SELECT window_group_name, window_name FROM dba_scheduler_wingroup_members;
Window Schedules?
SELECT window_name, start_time, duration FROM dba_autotask_schedule;
Manually gather Database Statistics in this Schema:
EXEC dbms_stats.gather_schema_stats(ownname=>NULL, cascade=>TRUE); -- cascade=>TRUE means include Table Indexes too.
Manually gather Database Statistics in all Schemas!
-- Probably need to CONNECT / AS SYSDBA
EXEC dbms_stats.gather_database_stats;
Whenever the data changes "significantly".
If a table goes from 1 row to 200 rows, that's a significant change. When a table goes from 100,000 rows to 150,000 rows, that's not a terribly significant change. When a table goes from 1000 rows all with identical values in commonly-queried column X to 1000 rows with nearly unique values in column X, that's a significant change.
Statistics store information about item counts and relative frequencies -- things that will let it "guess" at how many rows will match a given criteria. When it guesses wrong, the optimizer can pick a very suboptimal query plan.
At my last job we ran statistics once a week. If I remember correctly, we scheduled them on a Thursday night, and on Friday the DBAs were very careful to monitor the longest running queries for anything unexpected. (Friday was picked because it was often just after a code release, and tended to be a fairly low traffic day.) When they saw a bad query they would find a better query plan and save that one so it wouldn't change again unexpectedly. (Oracle has tools to do this for you automatically, you tell it the query to optimize and it does.)
Many organizations avoid running statistics out of fear of bad query plans popping up unexpectedly. But this usually means that their query plans get worse and worse over time. And when they do run statistics then they encounter a number of problems. The resulting scramble to fix those issues confirms their fears about the dangers of running statistics. But if they ran statistics regularly, used the monitoring tools as they are supposed to, and fixed issues as they came up then they would have fewer headaches, and they wouldn't encounter them all at once.
What Oracle version are you using? Check this page which refers to Oracle 10:
http://www.acs.ilstu.edu/docs/Oracle/server.101/b10752/stats.htm
It says:
The recommended approach to gathering statistics is to allow Oracle to automatically gather the statistics. Oracle gathers statistics on all database objects automatically and maintains those statistics in a regularly-scheduled maintenance job.
When I was managing a large multi-user planning system backed by Oracle, our DBA had a weekly job that gathered statistics. Also, when we rolled out a significant change that could affect or be affected by statistics, we would force the job to run out of cycle to get things caught up.
With 10g and higher version of oracle, up to date statistics on tables and indexes are needed by the optimizer to make "good" execution plan decision. How often you collect statistics is a tricky call. It depends on your application, schema, data rate and business practice. Some third party apps which are written to be backward compatible with older version of oracle do not perform well with the new optimizer. Those application require that tables have no stats so that the db resorts back to rule base execution plan. But on the average oracle recommends that stats be collected on tables with stale statistics. You can set tables to be monitor and check their state and have them analyze if/when stale. Often that is enough, sometime it is not. It really depend on your database. For my database we have a set of OLTP tables that need nightly stats collection to maintain performance. Other tables are analyze once a week. On our large dw database, we analyze as needed as the tables are too large for regular analysis without affecting overall db load and performance. So the correct answer is, it depends on the application, data change and business needs.
Make sure to balance the risk that fresh statistics cause undesirable changes to query plans against the risk that stale statistics can themselves cause query plans to change.
Imagine you have a bug database with a table ISSUE and a column CREATE_DATE where the values in the column increase more or less monotonically. Now, assume that there is a histogram on this column that tells Oracle that the values for this column are uniformly distributed between January 1, 2008 and September 17, 2008. This makes it possible for the optimizer to reasonably estimate the number of rows that would be returned if you were looking for all issues created last week (i.e. September 7 - 13). If the application continues to be used and the statistics are never updated, though, this histogram will be less and less accurate. So the optimizer will expect queries for "issues created last week" to be less and less accurate over time and may eventually cause Oracle to change the query plan negatively.
In the case of a data warehouse-type system you can consider collecting no statistics at all, and relying on dynamic sampling (setting optimizer_dynamic_sampling to level 2 or above).
Generally it's not recommended to gather statistics so frequent on the whole database unless you have a strong justification for that, such as a bulk insert or big data change happen frequently on the database.
gathering statistics on the database in this frequency MAY change the queries execution plan to a new poor execution plans, the thing may cost you much time trying to tune every query affected by the new poor plans, this is why you should test the impact of gathering new statistics on a test database, or in case you don't have the time or the man power for that, at least you should keep a fallback plan by backing up the original statics before you gather new ones, so in case you gather a new statistics and then the queries didn't perform as expected, you can easily restore back the original statistics.
There is a very useful script can help you backup original statistics and gather new ones and provide you with SQL command you can use to restore back the original statics in case the thing didn't go as expected after gathering new statistics. You can find the script in this link:
http://dba-tips.blogspot.com/2014/09/script-to-ease-gathering-statistics-on.html
Related
I have a query that takes about 1.5 minutes to run if it runs as an insert into an empty temp table, but selecting the data doesn't complete even after 20 minutes. Why would a select take so much longer than an insert? Will SQL Server draw different plans based on that?
The estimated execution plans are the same for both, the actual execution plan is different than the estimated.
Update: It seems that the INSERT version of the query is using
parallelism, while the select version is not. My question still
remains as to why that is?
Once a single query plan reaches a certain level of complexity, I often have a better experience sequencing a collection of smaller queries together than one really big nasty query. This enables the query optimizer to evaluate a smaller request.
Locking duration in a highly transnational database can also be a significant contributing factor.
It sounds like you're on to something with the parallelism. There are several operators within SQL that will prevent parallelism.
What is a good way to handle statistics on SQL 2008 for very large databases? Multiple tables with 100m+ rows in each.
Should auto update statistics be on? Will auto update statistics async help at all? should a job be setup to manually update statistics on some kind of schedule?
Usually data is added to the table but older data isn't changed very often.
Update: About 100k rows inserted each hour. Mostly reporting is done on the data. Updates can happen on 1-2 columns on ~500k rows per day.
For one I would not want update stats to run in the middle of the day on a large table, so I would say no. Also you need to hit the threshold (20% I believe) before it kicks in anyway
Now if you have a job already that rebuilds the index then stats are updated automatically (this is not true with a reorg/defrag)
Also 100 million rows doesn't mean much, how many columns if the table is 12 bytes wide (per row) compared to 4100 bytes that is a big difference (especially since with the 4100 bytes per row table you can only fir 1 row per page)
What is a good way to handle statistics on SQL 2008 for very large databases? Multiple tables with
100m+ rows in each.
PLEASE dont call this very large. I give you an example of very large. We just run a sql statement on some data in our warehouse. Temp space usage tops at 180gb. For that statement. Db? two digits terabytes. 100m+ rows are not small, but not very large.
Should auto update statistics be on? Will auto update statistics async help at all? should a job
be setup to manually update statistics on some kind of schedule?
Depend. On update and usage patterns.
Usually data is added to the table but older data isn't changed very often.
How often? How much in percentage? What data? Do the statistics get out of scope fast or slowly move? YOu ahve to provide a LOT more information to make sensible suggestions.
Should auto update statistics be on?
It Depends...
Will auto update statistics async help
at all?
It will help prevent a stats update that takes a long time from killing a query. Basically this tells SQL Server that if a query comes in and it realizes stats are outdated instead of holding the query, updating the stats, then running the query. just run the query and update the stats behind the scenes. So that particular query that kicked off the need for a stats update won't get any benefits but it also won't sit around waiting for the stats to update first either.
should a job be setup to manually
update statistics on some kind of
schedule?
Yes! Stats are only updated if 20% of a tables data has been "changed". On very big tables that can basically be the same as saying stats will never be updated. If you have any large tables where new data is being added you should always have a scheduled process to update stats on them.
"It depends" is a good answer but in the absence of reproducable and measurable improvement I'd leave it at the default.
If you update stats manually overnight then you have less chance of an auto-update kicking in. And you can defer a stats update by setting AUTO_UPDATE_STATISTICS_ASYNC (See "When to Use Synchronous or Asynchronous Statistics Updates")
On balance I wouldn't disable it or change the default which is "on".
Setup
Cost of Threshold for Parallelism : 5
Max Degree of Parallelism : 4
Number of Processors : 8
SQL Server 2008 10.0.2.2757
I have a query with many joins, many records.
The design is a star. ( Central table with fks to the reference tables )
The central table is partitioned on the relevant date column.
The partition schema is split by days
The data is very well split across the partition schema - as judged by comparing the sizes of the files in the filegroups assigned to the partition schema
Queries involved have the predicate set over the partitioned column. such as ( cs.dte >= #min_date and cs.dte < #max_date )
The values of the date parameters are a day apart # midnight so, 2010-02-01, 2010-02-02
The estimated query plan shows no parallelism
a) This question is in regards to Sql Server 2008 Database Engine. When a query in the OLTP engine is running, I would like to see / have the sort of insight one gets when profiling an SSAS Query using Progress End event - where one sees something like "Done reading PartititionXYZ".
b) if the estimated query plan or the actual query plan shows no parallel processing does that mean that all partitions will be / were checked / read? * What I was trying to say here was - just b/c I don't see parallelism in a query plan, that doesn't guarantee the query isn't hitting multiple partitions - right? Or - is there a solid relationship between parallelism and # partitions accessed?
c) suggestions? Is there more information that I need to provide?
d) how can I tell if a query is processing in parallel w/o looking # the actual query plan? * I'm really only interested in this if it is helpful in pinning down what partitions are being used.
Added Nov 10
Try this:
Create querys that should hit 1, 3, and all your partitions
Open an SSMS query window, and run SET SHOWPLAN_XML ON
Run each query one by one in that window
Each run will kick out a chunk of XML
Compare these XML results (I use a text diff tool, “CompareIt”, but any similar tool would do)
You should see that the execution plans are significantly different. In my “3” and “All” querys, there’s a chunk of text tagged as “ConstantScan” that has an entry for (respectively) 3 and All partitions in the table, and that section is not present for the “1 partition” query. I use this to infer that yes indeed, SQL doing what it says it will do, to wit: only read as much of the table as it believes it needs to in order to resovle the query.
Got a pretty good answer here: http://www.sqlservercentral.com/Forums/Topic1064946-391-1.aspx#bm1065048
a) I am not aware of any way to determine how a query has progressed while the query is still running. Maybe something finicky with the latching and locking system views, but I doubt it. (I am, alas, not familiar enough with SSAS to draw parallels between the two.)
b) SQL will probably use parallelism when working with multiple partitions within a single table, in which case you will see parallel processing "tokens" in your query plan. However, if for whatever reason parallelism is not invoked yet multiple partitions must be read, they will be read without the use of parallelism.
d) Another thing that perhaps cannot be done. Under very controlled cirsumstances, you could use System Monitor (Perfmon) to track CPU usage or perhaps disk reads during the execution of they query. This won't help if the server is performing other work, or the data is resident in memory (the buffer cache), and so may be of limited use.
c) What is it you are actually trying to figure out? Which partitions (if any) are being accessed by users over a period of time? Is SQL generating a "smart" query plan? Without details of the data, structure, and query, it's hard to come up with advice.
We have a full text index on a fairly large table of 633,569 records. The index is rebuilt from scratch as part of a maintenance plan every evening, after a bunch of DTS packages run that delete / insert records. Large chunks of data are deleted, then inserted (to take care of updates and inserts), so incremental indexing is not a possibility. Changing the packages to only delete when necessary is not a possibility either as it is a legacy application that will eventually be replaced.
The FTI includes two columns - one a varchar(50) not null and a varchar(255) null.
There is a clustered index on the primary key column, which is just an identity column. There is also an combined index on an integer column and the varchar(50) column mentioned above. This latter index was added for performance reasons.
The problem is that the re-indexing is painfully slow - about 8 hours.
The server is fairly robust (dual processor, 4gb of ram), and everything runs quickly beyond this re-indexing.
Any tips on how to speed this up?
UPDATE
Our client has access to the sql box. Turns out they turned on change tracking on the table that is part of the full text index. We turned this off, and the full population took less than 3 hours. Still not great, but better than 8.
UPDATE 2
The FTI is again taking ~8 hours to populate.
SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme.
Use change tracking with the "update
index in background" option.
The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process.
First, you (or Enterprise Manager) run this system stored procedure:
sp_fulltext_getdata CatalogID, object_id
After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table.So if you have 50 million rows in your database, this procedure will be issued 50 million times.
On the other hand, if you use an incremental population, MSSearch will issue an initial:
sp_fulltext_getdata CatalogID, object_id
for each row in the table that you're full-text indexing. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing.
Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming.
I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:
sp_fulltext_getdata CatalogID, object_id
for every row in the table with change tracking enabled.Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table and will then remove them from the sysfulltextnotify table.
Consider using a separate build
server
Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date and an MSSearch engine that's sometimes unavailable consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog .Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution
Limit activity when population is
running
When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources.
Increase the number of threads for
the indexing process
Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems.
Stop any anti-virus or open
file-agent backup software.
If this is not possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories
Place the catalog,temp directory and
pagefiles on their own controllers
If you can make that investment.Place the catalog on its own controller, preferably on a RAID-1 array.Place the temp directory on a RAID-1 array. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.
Consider creating secondary data
files for the Temp DB - 1 per CPU /
Core.
Do you have enough RAM?
What are your file drive placements in terms of RAID configuration?
Are you seeing high tempDB activity?
(BTW, half a million records is not large; it's not even medium... ;) )
Is the system offline whilst you are doing the reindex or live ?
Are these the only items in your full text catalog; if not you might want to consider separating them out from the remainder of your FTS data. (Might help with monitoring too) In the index is the identity column configured as the unique key ?
Can you quantify the large amounts of changes? There are 3 basic options for repopulation; You might want to try switching to full or incremental as one may suit you better than the one you are using now. In my experience incremental works well if changes to the total DB are less than 40% (had a similar issue during large data take ons into the database.) If >40% change then full is likely better (from my experience - i index documents so it might work differently for you) The third option you might want to consider try the Change Tracking with scheduled update reindex option.
If you can take the server off-line to users then what performance settings do you have FTS running under whilst reindexing? You can check this Full-Text Search Service Properties / Performance tab - System Resource Usage as a slider (think there are 4 or 5 positions). There is probably a system proc to change this dont know it and dont have a 2000 machine to check anymore.
FTS / Reindexing loves ram and lots of it; the general rule of thumb is have virtual memory 3x the physical memory; if you have several physical disks then create several Pagefile.sys files, so that each Pagefile.sys file will be placed on its own physical disk. Are you on NT or Windows 2000 ? check that extended memory over 2gb is actually configured properly.
Try putting the index on a separate physical disk than the database.
EDIT: Scott reports this is already the case.
Disallowing nulls in the column that currently does might not speed up the index, but in my experience is a better practice, especially for indexing purposes. The only columns I can justify allowing nulls in are date columns.
Here is a checklist of parameters for FT-indexing performance on SQL Server. Most of them are already quoted and checked here. I don't find one of them on your comments though:
The SQL Server MAX SERVER MEMORY setting should be set manually (dynamic memory allocation is turned off) so that enough virtual memory is left for the Full-Text Search service to run. To achieve this, select a MAX SERVER MEMORY setting that once set, leaves enough virtual memory so that the Full-Text Search service is able to access an amount of virtual memory equal to 1.5 times the amount of physical RAM in the server. This will take some trial and error to achieve this setting.
Improve the Performance of Full-Text Indexes: http://msdn.microsoft.com/en-us/library/ms142560.aspx
For example, for heavily used tables with volumes in the order of 10 million rows that grow by a million rows a month, if the stats are 6-8 months old how detrimental to the performance of the database is this going to be? How often should you be refreshing the stats?
Statistics are kept and used by the query planner, and they have a noticeable impact. I can't give you exact guidelines on how often you should refresh them. That will depend on how much work it takes to refresh them and how much impact fresh stats have on your queries. The real answer for this is to take good measurements and judge options by the results. Tinkering without measurement is a throw of the dice.
We refresh stats every night. No sense waiting for the Weekend if the stats could be refreshed nightly - by Friday they will be worse than they were on Monday ...
Problem is what if it takes too long?
For databases which have that problem we refresh stats on certain tables each night - so some tables are done every night, some less often. (We have a database table of which tables to do when, and a history of how long the Stats took to regenerate, and tune the schedule accordingly)
if the stats are 6-8 months old how detrimental to the performance of the database is this going to be
I would be very surprised if it didn't make a huge difference on a table growing by 1 million rows-per-month
If that is your actual state I would expect that the tables need defragging too
Implications are dire. You should be refreshing them as often as you can to give the optimizer the best information to make decisions. You will be able to find out how bad the statistics are by running the optdiag utility. Analysing the output and running again to compare over a few days or a week will let you know exactly how bad the situation is. I would recommend that at the earliest convieniance you drop and recreate the indexes and run 'update index statistics' on the table in question. This should be enough information to get you through. I am assuming that you are able to analyse the output of optdiag though.