What happens to a clustered index when PK is created on two columns in SQL Server - sql-server

I just created a table with TWO primary keys in SQL Server. One column is age, another is ID number and I set the option to CLUSTER INDEX, so it automatically creates a cluster index on both columns. However, when I query the table, the results only seem to sort the ID and completely disregard/ignore the AGE (other PK and other Cluster index column). Why is this? Why is it only sorting based on the first cluster index column?

The query optimizer may decide to use the physical ordering of the rows in the table if there is no advantage in ordering any other way. So, when you select from the table using a simple query, it may be ordered this way. It is very easy to assume that the rows are physically stored in the order specified within the definition of your clustered index. But this turns out to be a false assumption.
Please view the following article for more details: Clustered Index do “NOT” guarantee Physically Ordering or Sorting of Rows

Related

Querying a High Cardinality Field

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?
I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

Whats the difference between table scanning a clustered table, vs index scanning

THE SITUATION
I Have a table with only one index, a Clustered index (two columns).
I do a 'SELECT * FROM TABLE' and the optimizer decides a Table scan.
I get the rows kinda sorted by clustered index. I say kinda because it doesn't look randomly sorted, but it has a lot of glitches.
If I force Using the clustered index SELECT * FROM TABLE (index 1 MRU) I get exact the clustered table order.
QUESTIONS
how can the table scan result be different in order than clustered index scan if the data in a clustered table is sorted by its index?
Is the table scan in a clustered index a scan to the leaf level of the table, aren't those sorted?
Is the clustered index scan a scan to all the possible paths of the b-tree in an ordered manner?
excuse my possible lack of knowledge, I'm trying my best to undestand the underlying concepts.
HOW DID I TESTED THIS
I achived this inconsistent ordering results by testing two different clustered indexes (one with two columns and other with one column). creating and dropping the constraint and check the select statement.
after truncating the table and creating the index, the data is correctly sorted, but after dropping the index and creating a different one, that data is not perfectly sorted with a table scan. I need to force index use.
WHY IS THIS IMPORTANT
Because I want to garantee order without using an order by clause in a clustered table.
On 15.0 and upwards ALWAYS specify an order by if you want a specific order as the structure of the data and index varies between allpages and data only locked (DOL) tables.
The optimizer may choose to do parts of the query retrieval in parallel under the covers for example depending on your parallelism settings which is why the order by is important. Just saying select * hasn't requested any specific order.
Just add the order by and you'll be fine because the select * is going to tablescan anyway as you're asking for the whole table and therefore no need for index hints.
THE EXPLANATION
Clustered indexes are logically ordered but not physically ordered.
This means that a table scan if it's done in physical order will return different results than clustered index scan, which is sorted logically.
This logical-physical mapping is controlled by OAM (Object Allocation Map)

SQL Server Clustered Index and Non-Clustered Index for same column

If one column in a table has both clustered and non-clustered index defined due to any reason, is there any disadvantage in that? Just curious.
If both indices are on the same identical column or columns (and in the same order) then yes, they both provide the same select query optimization for individual record selects; and although the Clustered index, in addition, provides enhanced performance for select queries that return multiple records filtered on a range of values for that column, the non-clustered on is redundant.
But by having both in place you incur an additional write (Insert/Update/Delete) performance hit for the process of having to update two indices instead of only one.

What is a Bookmark Lookup in Sql Server?

I'm in the process of trying to optimize a query that looks up historical data. I'm using the query analyzer to lookup the Execution Plan and have found that the majority of my query cost is on something called a "Bookmark Lookup". I've never seen this node in an execution plan before and don't know what it means.
Is this a good thing or a bad thing in a query?
A bookmark lookup is the process of finding the actual data in the SQL table, based on an entry found in a non-clustered index.
When you search for a value in a non-clustered index, and your query needs more fields than are part of the index leaf node (all the index fields, plus any possible INCLUDE columns), then SQL Server needs to go retrieve the actual data page(s) - that's what's called a bookmark lookup.
In some cases, that's really the only way to go - only if your query would require just one more field (not a whole bunch of 'em), it might be a good idea to INCLUDE that field in the non-clustered index. In that case, the leaf-level node of the non-clustered index would contain all fields needed to satisfy your query (a "covering" index), and thus a bookmark lookup wouldn't be necessary anymore.
Marc
It's a NESTED LOOP which joins a non-clustered index with the table itself on a row pointer.
Happens for the queries like this:
SELECT col1
FROM table
WHERE col2 BETWEEN 1 AND 10
, if you have an index on col2.
The index on col2 contains pointers to the indexed rows.
So, in order to retrieve the value of col1, the engine needs to scan the index on col2 for the key values from 1 to 10, and for each index leaf, refer to the table itself using the pointer contained in the leaf, to find out the value of col1.
This article points out that a Bookmark Lookup is SQL Server 2000's term, which is replaced by NESTED LOOP's between the index and the table in SQL Server 2005 and above
From MSDN regarding Bookmark Lookups:
The Bookmark Lookup operator uses a
bookmark (row ID or clustering key) to
look up the corresponding row in the
table or clustered index. The Argument
column contains the bookmark label
used to look up the row in the table
or clustered index. The Argument
column also contains the name of the
table or clustered index in which the
row is looked up. If the WITH PREFETCH
clause appears in the Argument column,
the query processor has determined
that it is optimal to use asynchronous
prefetching (read-ahead) when looking
up bookmarks in the table or clustered
index.

Should I get rid of clustered indexes on Guid columns

I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.
You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.
While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;
If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.
Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...
As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.
Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.
It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.

Resources