SQL Azure. Create Index recommendation and performance

SQL Azure. Create Index recommendation and performance - sql-server

I got several CREATE INDEX recommendations on Azure SQL S3 tier.
Before going through, I'd like to know some issues during indexing with 10-million records.
Can we know indexing progress or completion time approximately?
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
Does it perform differently from my CREAT INDEX command?
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?

Can we know indexing progress or completion time approximately?
You can get to know amount of space that will be used ,but not index creation time.You can track the progress though using sys.dm_exec_requests
also with SQL2016(azure compatabilty level 130) there is a new DMV called Sys.dm_exec_query_profiles..which can track accurate status better then exec requests DMV..
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
There are two ways to create Index
1.Online
2.Offline
When you create index online,your table will not be blocked*,since SQL maintains a separate copy of index and updates both indexes parallely
with offline approach, you will experience blocking and table also won't be available
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
You will experience additional IO load,increase in memory..This can't be accurately estimated.
Does it perform differently from my CREATE INDEX command?
Create Index is altogether a seperate statement ,i am not sure what you meant here
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?
Index creation is logged and all the TLOG is replayed on secondary as well.so there is no need to do index rebuilds on secondary..

Related

SQL Server: nonclustered indexes after MERGE (insert / update)

I'm new to SQL Server, trying to optimize a procedure I received from ex-colleague (and I can't ask him).
At the final step, the procedure updates a large table using a MERGE statement. After that, it drops two nonclustered indexes and creates them again. What is the purpose of doing that? Aren't collected statistics for optimizer being recollected regularly? Is recreating indexes the only way to provide optimizer with fresh statistics?
Thanks

After that, it drops two nonclustered indexes and creates them again. What is the purpose
of doing that?
To organize them into less space without possibly page splits that may have happened during the merge. Generally it is NOT needed - like at all. It MAY make sense, but it is much better to actually analyze the index statistics about page splits before doing that unless you can be sure it is beneficial on every load.
Aren't collected statistics for optimizer being recollected regularly?
They are, but reoganizing the indices may make them more efficient. As data changes, data in index pages changes and when it overflows then a page is split. This leads to the index (nto the statistics on it) being unbalanced over time, which may lead to additional IO load.
Is recreating indexes the only way to provide optimizer with fresh statistics?
No. But you do not do it for statistics in the first place. You can just update the statistics if you want this. You do it to get an efficient index.

SQL Server Indexed View Question

I have a requirement to create a report that is killing the processor and taking a long time to run.
I think I could speed this up significantly by creating an index view that keeps all this data in one place making it a lot easy to query/report on. This view would not just be used for the report as I think it would benefit quite a few areas in the data layer.
The indexed view will potentially contain 5 million+ records, I cant seem to find any guidance as to at what point indexed views are not longer recommended. I assume that an index view of this size would take considerable time to build when SQL first starts, but I would hope after this the cost of maintaining it would be minimal.
Is there any kind of best practice guide as to when to use index views and when not to use them? Would the view rebuild itself after every server restart or does it get stored somewhere on the disk?

The index associated with your Indexed View will be updated whenever updates are made to the any of the columns in the index.
High numbers of updates will most likely kill the benefit. If it is mainly reads then it will work fine.
The real benefits of Indexed Views are when you have aggregates that are too expensive to compute in real time.
Please see: Improving Performance with SQL Server 2008 Indexed Views:
Indexed views can increase query
performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize
expensive computations during query
execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
The query optimizer considers indexed
views only for queries with nontrivial
cost. This avoids situations where
trying to match various indexed views
during the query optimization costs
more than the savings achieved by the
indexed view usage. Indexed views are
rarely used in queries with a cost of
less than 1.
Applications that benefit from the
implementation of indexed views
include:
Decision support workloads.
Data marts.
Data warehouses.
Online analytical processing (OLAP) stores and sources.
Data mining workloads.
From the query type and pattern point
of view, the benefiting applications
can be characterized as those
containing:
Joins and aggregations of large tables.
Repeated patterns of queries.
Repeated aggregations on the same or overlapping sets of columns.
Repeated joins of the same tables on the same keys.
Combinations of the above.

An indexed view (aka materialized view) is maintained by SQL Server after every change to the underlying table(s). Needless to say, you should not have an indexed view on a table that has traffic.
For your problem, a better solution would be to run the query and store it in its own table, like:
select * into CachedReport from YourView
That will give you the performance of an indexed view, while you can decide when to refresh it. For example, you could refresh it by running the select into query from a scheduled job every night.

I'm not aware of any guidance concerning size of indexed views. It's effectively a separate table that's being "automagically" updated every time the base tables on which it depends are updated, so I tend to think of it as a separate table.
As to your question on the building of the index - it's stored on disk, the same as every other index, so it doesn't get rebuilt during server restart (other than any repair that takes place due to transactions not having completed before the restart).

There's no hard row number limit on when to use a table or a materialised view.
However as a guide line avoid a materialised view over volatile tables - the load can kill your server.
First off as Timothy suggested check the indexes on your underlying tables, then the statistics. Your Query Optimiser might be just on the complete track due to missing/out of date statistics.
If this doesn't help with performance check what data is really required from the view as my guess is that a) the row count and b) the row size is what is killing your server loading the whole view into temp table and running it through I/O contention.

MS-SQL Server 2000 slow full text indexing

We have a full text index on a fairly large table of 633,569 records. The index is rebuilt from scratch as part of a maintenance plan every evening, after a bunch of DTS packages run that delete / insert records. Large chunks of data are deleted, then inserted (to take care of updates and inserts), so incremental indexing is not a possibility. Changing the packages to only delete when necessary is not a possibility either as it is a legacy application that will eventually be replaced.
The FTI includes two columns - one a varchar(50) not null and a varchar(255) null.
There is a clustered index on the primary key column, which is just an identity column. There is also an combined index on an integer column and the varchar(50) column mentioned above. This latter index was added for performance reasons.
The problem is that the re-indexing is painfully slow - about 8 hours.
The server is fairly robust (dual processor, 4gb of ram), and everything runs quickly beyond this re-indexing.
Any tips on how to speed this up?
UPDATE
Our client has access to the sql box. Turns out they turned on change tracking on the table that is part of the full text index. We turned this off, and the full population took less than 3 hours. Still not great, but better than 8.
UPDATE 2
The FTI is again taking ~8 hours to populate.

SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme.
Use change tracking with the "update
index in background" option.
The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process.
First, you (or Enterprise Manager) run this system stored procedure:
sp_fulltext_getdata CatalogID, object_id
After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table.So if you have 50 million rows in your database, this procedure will be issued 50 million times.
On the other hand, if you use an incremental population, MSSearch will issue an initial:
sp_fulltext_getdata CatalogID, object_id
for each row in the table that you're full-text indexing. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing.
Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming.
I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:
sp_fulltext_getdata CatalogID, object_id
for every row in the table with change tracking enabled.Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table and will then remove them from the sysfulltextnotify table.
Consider using a separate build
server
Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date and an MSSearch engine that's sometimes unavailable consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog .Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution
Limit activity when population is
running
When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources.
Increase the number of threads for
the indexing process
Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems.
Stop any anti-virus or open
file-agent backup software.
If this is not possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories
Place the catalog,temp directory and
pagefiles on their own controllers
If you can make that investment.Place the catalog on its own controller, preferably on a RAID-1 array.Place the temp directory on a RAID-1 array. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.
Consider creating secondary data
files for the Temp DB - 1 per CPU /
Core.

Do you have enough RAM?
What are your file drive placements in terms of RAID configuration?
Are you seeing high tempDB activity?
(BTW, half a million records is not large; it's not even medium... ;) )

Is the system offline whilst you are doing the reindex or live ?
Are these the only items in your full text catalog; if not you might want to consider separating them out from the remainder of your FTS data. (Might help with monitoring too) In the index is the identity column configured as the unique key ?
Can you quantify the large amounts of changes? There are 3 basic options for repopulation; You might want to try switching to full or incremental as one may suit you better than the one you are using now. In my experience incremental works well if changes to the total DB are less than 40% (had a similar issue during large data take ons into the database.) If >40% change then full is likely better (from my experience - i index documents so it might work differently for you) The third option you might want to consider try the Change Tracking with scheduled update reindex option.
If you can take the server off-line to users then what performance settings do you have FTS running under whilst reindexing? You can check this Full-Text Search Service Properties / Performance tab - System Resource Usage as a slider (think there are 4 or 5 positions). There is probably a system proc to change this dont know it and dont have a 2000 machine to check anymore.
FTS / Reindexing loves ram and lots of it; the general rule of thumb is have virtual memory 3x the physical memory; if you have several physical disks then create several Pagefile.sys files, so that each Pagefile.sys file will be placed on its own physical disk. Are you on NT or Windows 2000 ? check that extended memory over 2gb is actually configured properly.

Try putting the index on a separate physical disk than the database.
EDIT: Scott reports this is already the case.

Disallowing nulls in the column that currently does might not speed up the index, but in my experience is a better practice, especially for indexing purposes. The only columns I can justify allowing nulls in are date columns.

Here is a checklist of parameters for FT-indexing performance on SQL Server. Most of them are already quoted and checked here. I don't find one of them on your comments though:
The SQL Server MAX SERVER MEMORY setting should be set manually (dynamic memory allocation is turned off) so that enough virtual memory is left for the Full-Text Search service to run. To achieve this, select a MAX SERVER MEMORY setting that once set, leaves enough virtual memory so that the Full-Text Search service is able to access an amount of virtual memory equal to 1.5 times the amount of physical RAM in the server. This will take some trial and error to achieve this setting.

Improve the Performance of Full-Text Indexes: http://msdn.microsoft.com/en-us/library/ms142560.aspx

What are some best practices and "rules of thumb" for creating database indexes?

I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).

Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.

Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),

As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?

It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.

I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).

Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.

In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.

In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.

In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.

I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!

Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.

There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.

This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.

An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.

If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.

One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.

How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.

What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.

Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.

It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight