I have 2 tables (UserLog and UserInfo) on which there is a nonclustered index on User_UID column which is a unique identifier.
I have a lot of select queries that join these 2 tables on the User_UID column.
There is no cluster index on these tables; so to improve read performance I decide to create, a new column User_ID and then create a cluster index on this column, on each table.
I then tested the new architecture and I obtained great results since I decrease the logical read on both tables since the query optimiser don't use anymore a RID lookup in order to retreive the remain informations. Instead it use only the cluster index seek.
I obtained these good results only when the pages are already in memory cache, i.e. after 2 executions. However if I clean the cache (dbcc dropcleanbuffers) the first execution of the select query give also less logical read but the elapsed time is greater than it was when I execute the same query whith the old architecture (without the clustered index) just after cleaning cache.
So my question is why the elapsed time with the new architecture increase after cleaning the cache. Is it because in the first execution all data have to go into the memory cache and since in the cluster index we have more data than in non cluster index it takes more time??
Thanks in advance
Regardless, you should have a clustered index on your table. If you don't, you have a heap and that requires scans through the leaf level of the table. With a clustered index, your table is now sorted into a b-tree that is used for navigation to the leaf level and is more efficient.
By blowing out the the buffer, whether you have a seek on a clustered index or scan on a heap, the pages will be pulled from disk and that takes time.
Related
i have a huge database, around 1 TB in size, most of the space is consumed by a table which stores images, the tables has right now almost 800k rows.
server response time has increased, i would like to know which techniques should i use or you recomend, partitioning? o how to reorganize the table
every row is accessed by the image id column, and it has its clustered index by that column, and every two days i reorganize the index and every 7 days i rebuild it, but it seems not to be working
any suggestions?
If the table is clustered by image_id and you access always by image_id then the size of the table is irrelevant, and so is the fragmentation (no need to rebuild).
If you see performance decrease, then there most be something else at play. You are doing range scans? Look in sys.dm_db_index_usage_stats, does the user_scans column differ from 0? It means you have queries that do scans.
Unless you measure where the time increase occurs, you'll be shooting blanks in the dark and never solve the problem correctly. Apply a methodological approach, like Waits and Queues to identify the problem.
One thing I can tell you right now: partitioning is never a performance improvement. It is intended for data maintenance (switch in/switch out) and for spreading the load on controlled fashion on filegroups. But you can never expect partitioning to improve performance, you can at best hope for equal performance with non-partitioned table.
If the response time is increasing, you must be doing more with this table than just pulling images for ids?
What other data columns are stored in your images table?
If you have a clustered index on an id (probably identity), that's fine, but adding an additional nonclustered index which can be covering for search criteria will probably help.
Say you also have columns for name or tag or region or whatever in this images table (and assuming you aren't going to vertically partition this table into separate tables), then having a nonclustered index on tag, id INCLUDE(name), say or something which matches your usage patterns will help a lot.
Remember: A clustered index is not an index, it's just the way the data is organized. It will usually not help much in any kind of search operations - it primarily works well on identity lookups, when you are reading almost every column, and streaming data in the order of the clustered index.
I have read that one of the tradeoffs for adding table indexes in SQL Server is the increased cost of insert/update/delete queries to benefit the performance of select queries.
I can conceptually understand what happens in the case of an insert because SQL Server has to write entries into each index matching the new rows, but update and delete are a little more murky to me because I can't quite wrap my head around what the database engine has to do.
Let's take DELETE as an example and assume I have the following schema (pardon the pseudo-SQL)
TABLE Foo
col1 int
,col2 int
,col3 int
,col4 int
PRIMARY KEY (col1,col2)
INDEX IX_1
col3
INCLUDE
col4
Now, if I issue the statement
DELETE FROM Foo WHERE col1=12 AND col2 > 34
I understand what the engine must do to update the table (or clustered index if you prefer). The index is set up to make it easy to find the range of rows to be removed and do so.
However, at this point it also needs to update IX_1 and the query that I gave it gives no obvious efficient way for the database engine to find the rows to update. Is it forced to do a full index scan at this point? Does the engine read the rows from the clustered index first and generate a smarter internal delete against the index?
It might help me to wrap my head around this if I understood better what is going on under the hood, but I guess my real question is this. I have a database that is spending a significant amount of time in delete and I'm trying to figure out what I can do about it.
When I display the execution plan for the deletion, it just shows an entry for "Clustered Index Delete" on table Foo which lists in the details section the other indices that need to be updated but I don't get any indication of the relative cost of these other indices.
Are they all equal in this case? Is there some way that I can estimate the impact of removing one or more of these indices without having to actually try it?
Nonclustered indexes also store the clustered keys.
It does not have to do a full scan, since:
your query will use the clustered index to locate rows
rows contain the other index value (c3)
using the other index value (c3) and the clustered index values (c1,c2), it can locate matching entries in the other index.
(Note: I had trouble interpreting the docs, but I would imagine that IX_1 in your case could be defined as if it was also sorted on c1,c2. Since these are already stored in the index, it would make perfect sense to use them to more efficiently locate records for e.g. updates and deletes.)
All this, however has a cost. For each matching row:
it has to read the row, to find out the value for c3
it has to find the entry for (c3,c1,c2) in the nonclustered index
it has to delete the entry from there as well.
Furthermore, while the range query can be efficient on the clustered index in your case (linear access, after finding a match), maintenance of the other indexes will most likely result in random access to them for every matching row. Random access has a much higher cost than just enumerating B+ tree leaf nodes starting from a given match.
Given the above query, more time is spent on the non-clustered index maintenance - the amount depends heavily on the number of records selected by the col1 = 12 AND col2 > 34
predicate.
My guess is that the cost is conceptually the same as if you did not have a secondary index but had e.g. a separate table, holding (c3,c1,c2) as the only columns in a clustered key and you did a DELETE for each matching row using (c3,c1,c2). Obviously, index maintenance is internal to SQL Server and is faster, but conceptually, I guess the above is close.
The above would mean that maintenance costs of indexes would stay pretty close to each other, since the number of entries in each secondary index is the same (the number of records) and deletion can proceed only one-by-one on each index.
If you need the indexes, performance-wise, depending on the number of deleted records, you might be better off scheduling the deletes, dropping the indexes - that are not used during the delete - before the delete and adding them back after. Depending on the number of records affected, rebuilding the indexes might be faster.
If I have a table column with data and create an index on this column, will the index take same amount of disc space as the column itself?
I'm interested because I'm trying to understand if b-trees actually keep copies of column data in leaf nodes or they somehow point to it?
Sorry if this a "Will Java replace XML?" kind question.
UPDATE:
created a table without index with a single GUID column, added 1M rows - 26MB
same table with a primary key (clustered index) - 25MB (even less!), index size - 176KB
same table with a unique key (nonclustered index) - 26MB, index size - 27MB
So only nonclustered indexes take as much space as the data itself.
All measurements were done in SQL Server 2005
The B-Tree points to the row in the table, but the B-Tree itself still takes some space on disk.
Some database, have special table which embed the main index and the data. In Oracle, it's called IOT -- index-organized table.
Each row in a regular table can be identified by an internal ID (but it's database specific) which is used by the B-Tree to identify the row. In Oracle, it's called rowid and looks like AAAAECAABAAAAgiAAA :)
If I have a table column with data and
create an index on this column, will
the index take same amount of disc
space as the column itself?
In a basic B-Tree, you have the same number of node as the number of item in the column.
Consider 1,2,3,4:
1
/
2
\ 3
\ 4
The exact space can still be a bit different (the index is probably a bit bigger as it need to store links between nodes, it may not be balanced perfectly, etc.), and I guess database can use optimization to compress part of the index. But the order of magnitude between the index and the column data should be the same.
I'm almost sure it's quite a DB dependent, but generally – yeah, they take additional space. This happens because of two reasons:
This way you can utilize the fact
the data in BTREE leafs is sorted;
You gain lookup speed advantage as
you don't have to seek back and
forth to fetch neccessary stuff.
PS just checked our mysql server: for a 20GB table indexes take 10GB of space :)
Judging by this article, it will, in fact, take at least the same amount of space as the data in the column (in PostgreSQL, anyway).
The article also goes to suggest a strategy to reduce disk and memory usage.
A way to check for yourself would be to use e.g. the derby DB, create a table with a million rows and a single column, check it's size, create an index on the column and check it's size again. If you take the 10-15 minutes to do so, let us know the results. :)
Is this true that Update SQL Query is slow because of Clustered index??????
You would be better off saying 'slower' rather than 'slow'. When data is written to a clustered index, and it doesn't go at the very end of the table, data needs to be joggled around in order to fit it in, in the same way that adding a CD into a big stack of alphabetised CD is a lot slower than just sticking it on the top.
If you don't have any clustered indexes at all, then what you have is termed a "heap". You also have a heap of trouble, since the order of the data in your table is random - and selecting data from the table will be slow. That may be OK if you're doing many more INSERTs than you are SELECTs, but usually that's not the case.
Whether the clustered index makes INSERTs slower or not depends on:
The fill factor of the table (i.e. whether there are enough gaps in the data to allow new data to be inserted without moving everything around).
What columns are chosen as the cluster key.
If you're using an identity column as the cluster key, then you may find that insert performance is perfectly fine, since new entries are always being added on the end. The same may apply to a datetime column if using the current date (which of course also keeps increasing).
You need to keep the size of the cluster key small, since that's the index into the data that's stored in every other index. For example, if your cluster key consists of 3 ints and a datetime, then each entry in all your other indexes will include all that data in addition to whatever it was that you tried to index. For this reason, an identity column is actually a pretty good choice of cluster key since it's nice & small.
The perfect cluster key in any situation can only be chosen with a good deal of thought and a lot of testing (with realistically large data sets). Having a good cluster key can make a huge difference to SELECT performance - which normally outweighs any degradation in INSERT performance.
Define slow, ofcourse the clustered index will always be slower than a non-clustered index...
Insertion and updates are slower because of clustered indexes (particularly on huge tables) - but selects are way faster.
Making the index non-clustered usually improves inserts and updates performance retaining selection performance (selects are often less performant with a non-clustered index compared with a clustered index but something's gotta give).
A clustered index dictates how a table is physically stored on disk, and so updating a table with a clustered index may require that significant parts of the table be moved to make space for the new record, and that's slow.
You can mitigate the problem by setting an appropriate fillfactor for your indexes. It's not quite so bad that you have to re-jigger the whole table when you add a record to the middle; it's usually just a few pages. Fillfactor determines how much of each page is filled before creating a new page, and how much to leave as wiggle room for new insertions. A lower fillfactor on an index will leave more space for new records and therefore give faster insert times on average, at the cost of more disk space and more pages and therefore slower reads. But if you're doing a lot more updating than reading it may be worth it.
What are the differences between a clustered and a non-clustered index?
Clustered Index
Only one per table
Faster to read than non clustered as data is physically stored in index order
Non Clustered Index
Can be used many times per table
Quicker for insert and update operations than a clustered index
Both types of index will improve performance when select data with fields that use the index but will slow down update and insert operations.
Because of the slower insert and update clustered indexes should be set on a field that is normally incremental ie Id or Timestamp.
SQL Server will normally only use an index if its selectivity is above 95%.
Clustered indexes physically order the data on the disk. This means no extra data is needed for the index, but there can be only one clustered index (obviously). Accessing data using a clustered index is fastest.
All other indexes must be non-clustered. A non-clustered index has a duplicate of the data from the indexed columns kept ordered together with pointers to the actual data rows (pointers to the clustered index if there is one). This means that accessing data through a non-clustered index has to go through an extra layer of indirection. However if you select only the data that's available in the indexed columns you can get the data back directly from the duplicated index data (that's why it's a good idea to SELECT only the columns that you need and not use *)
Clustered indexes are stored physically on the table. This means they are the fastest and you can only have one clustered index per table.
Non-clustered indexes are stored separately, and you can have as many as you want.
The best option is to set your clustered index on the most used unique column, usually the PK. You should always have a well selected clustered index in your tables, unless a very compelling reason--can't think of a single one, but hey, it may be out there--for not doing so comes up.
Clustered Index
There can be only one clustered index for a table.
Usually made on the primary key.
The leaf nodes of a clustered index contain the data pages.
Non-Clustered Index
There can be only 249 non-clustered indexes for a table(till sql version 2005 later versions support upto 999 non-clustered indexes).
Usually made on the any key.
The leaf node of a nonclustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.
Clustered Index
Only one clustered index can be there in a table
Sort the records and store them physically according to the order
Data retrieval is faster than non-clustered indexes
Do not need extra space to store logical structure
Non Clustered Index
There can be any number of non-clustered indexes in a table
Do not affect the physical order. Create a logical order for data rows and use pointers to physical data files
Data insertion/update is faster than clustered index
Use extra space to store logical structure
Apart from these differences you have to know that when table is non-clustered (when the table doesn't have a clustered index) data files are unordered and it uses Heap data structure as the data structure.
Pros:
Clustered indexes work great for ranges (e.g. select * from my_table where my_key between #min and #max)
In some conditions, the DBMS will not have to do work to sort if you use an orderby statement.
Cons:
Clustered indexes are can slow down inserts because the physical layouts of the records have to be modified as records are put in if the new keys are not in sequential order.
Clustered basically means that the data is in that physical order in the table. This is why you can have only one per table.
Unclustered means it's "only" a logical order.
A clustered index actually describes the order in which records are physically stored on the disk, hence the reason you can only have one.
A Non-Clustered Index defines a logical order that does not match the physical order on disk.
An indexed database has two parts: a set of physical records, which are arranged in some arbitrary order, and a set of indexes which identify the sequence in which records should be read to yield a result sorted by some criterion. If there is no correlation between the physical arrangement and the index, then reading out all the records in order may require making lots of independent single-record read operations. Because a database may be able to read dozens of consecutive records in less time than it would take to read two non-consecutive records, performance may be improved if records which are consecutive in the index are also stored consecutively on disk. Specifying that an index is clustered will cause the database to make some effort (different databases differ as to how much) to arrange things so that groups of records which are consecutive in the index will be consecutive on disk.
For example, if one were to start with an empty non-clustered database and add 10,000 records in random sequence, the records would likely be added at the end in the order they were added. Reading out the database in order by the index would require 10,000 one-record reads. If one were to use a clustered database, however, the system might check when adding each record whether the previous record was stored by itself; if it found that to be the case, it might write that record with the new one at the end of the database. It could then look at the physical record before the slots where the moved records used to reside and see if the record that followed that was stored by itself. If it found that to be the case, it could move that record to that spot. Using this sort of approach would cause many records to be grouped together in pairs, thus potentially nearly doubling sequential read speed.
In reality, clustered databases use more sophisticated algorithms than this. A key thing to note, though, is that there is a tradeoff between the time required to update the database and the time required to read it sequentially. Maintaining a clustered database will significantly increase the amount of work required to add, remove, or update records in any way that would affect the sorting sequence. If the database will be read sequentially much more often than it will be updated, clustering can be a big win. If it will be updated often but seldom read out in sequence, clustering can be a big performance drain, especially if the sequence in which items are added to the database is independent of their sort order with regard to the clustered index.
A clustered index is essentially a sorted copy of the data in the indexed columns.
The main advantage of a clustered index is that when your query (seek) locates the data in the index then no additional IO is needed to retrieve that data.
The overhead of maintaining a clustered index, especially in a frequently updated table, can lead to poor performance and for that reason it may be preferable to create a non-clustered index.
You might have gone through theory part from the above posts:
-The clustered Index as we can see points directly to record i.e. its direct so it takes less time for a search. Additionally it will not take any extra memory/space to store the index
-While, in non-clustered Index, it indirectly points to the clustered Index then it will access the actual record, due to its indirect nature it will take some what more time to access.Also it needs its own memory/space to store the index
// Copied from MSDN, the second point of non-clustered index is not clearly mentioned in the other answers.
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be stored in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
Nonclustered
Nonclustered indexes have a structure separate from the data rows. A
nonclustered index contains the nonclustered index key values and
each key value entry has a pointer to the data row that contains the
key value.
The pointer from an index row in a nonclustered index to a data row
is called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table.
For a heap, a row locator is a pointer to the row. For a clustered
table, the row locator is the clustered index key.
Clustered Indexes
Clustered Indexes are faster for retrieval and slower for insertion
and update.
A table can have only one clustered index.
Don't require extra space to store logical structure.
Determines the order of storing the data on the disk.
Non-Clustered Indexes
Non-clustered indexes are slower in retrieving data and faster in
insertion and update.
A table can have multiple non-clustered indexes.
Require extra space to store logical structure.
Has no effect of order of storing data on the disk.