PK Index fragmentation on IDENTITY columns with VARBINARY(MAX) column in table - sql-server

I have a some tables(Table A and Table B) with a BIGINT with IDENTITY specification as primary key.
In those tables I have 2 VARBINARY(MAX) columns. Updates and deletes are very rare.
They have with almost the same row count, Table B a bit less but have significant more data in the VARBINARY(MAX) columns.
I was surprised to see that the storage used by PK in Table B was much higher than the storage used by PK in Table A.
Doing some reading, correct me if I am wrong, on the subject clarified that is has some thing to do with the max row size around 8k. So the there is some paging going on with a byte reference which is then included in the index. Hence the larger storage used by PK in Table B. It is around 30 percent of the total size of the DB.
I was of the assumption that only the BIGINT was part of the index.
My question is whether there is a workaround for that? Any designs, techniques or hacks that can prevent this?
Regards
Vilma

A PK is a CLUSTERED index: the data is stored with the key. You can have only one clustered index per table, because the data can only be stored in one place. So any clustered index (such as a PK) will take up more space than a non-clustered index.
If there is more varbinary table in B, then I would expect the PK to take up more space.
However, since this varbinary is (MAX) then the initial thought is that only the data pointer should be stored with the key. However, if the row is small enough (i.e. < 8000 bytes) I imagine that SQL Server optimises the store/retrieve by keeping the data with the key, thus increasing the size of the index. I do not know that this happens, but was unable to find anything to say it doesn't; as an optimisation is seems reasonable.
Take that for what it's worth!

Related

SQL - define keys to table

Is there any considerations to define keys for table that has lot of records already and most of operation that are operated on it are Insert ?
Key definition ultimately comes down to how you can uniquely and efficiently identify any specific row in a table. If a business key value fulfills that requirement, then it is a suitable candidate. An ideal key is also skinny. A GUID is horrible for this (IMHO) because it is far larger than it needs to be.
If insert performance is the most important priority and a suitable business key is not available, you can use an integer based identity key. If you expect more than 2.1 billion records within a few years, use bigint (9 quintillion records) instead.
Keep in mind that every index you make on the table will always include the PK. Having a skinny PK can make your indexes more efficient, using less storage, memory and CPU.
Insert speed is affected by the clustered index sort order as well as the number and sort order of all non-clustered indexes on the table. Column-store indexes are not sorted and have minimal overhead on inserts.
If you have a PK that store ID-number is more heavy then auto increases number, therefore when you define key keep in mind that it bather to create another column of PK for auto increases number.

How much of a performance impact does the data type have on the PK?

I have worked on a number of systems which use different types as their PK. Common types are:
Int32
Int16
varchar (of varying sizes, usually about 16)
UniqueIdentifier
I'm aware that the more memory used in the field the larger the indicies become and so the slower searches (so clearly an nvarchar(1024) would be very bad!)
How dramatic (if any) are the performance changes when using different data types for PK columns?
Choosing a combination of several varchar columns as a primary key it becomes large. SQL Server automatically adds a clustered index on a primary key, if you already don't have one and an index also becomes big (much bigger on varchar than on an integer column) which results with increased number of index pages used to store the index keys. This increases the number of reads required to read the index and degrades overall index performance. Integers require less bytes to store, hence the primary key index structure will be smaller.
Also, a primary key column(s) with varchar may make the JOIN statements slower as compared to the integer joins.
uniqueidentifier is also useful if you need GUID keys - guaranteed to be unique across all tables in your schema, databases, servers. But they are 4 times larger than the traditional 4-byte index value in integer and this may have performance and storage downsides
You can also see more info of a cost of GUID vs int here
And here Selecting the right datatype to improve database performance
Hope this helps

SQL server - worth indexing large string keys?

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.
An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.
General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Difference between clustered and nonclustered index [duplicate]

This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc
You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.
A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.
In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful
faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.

Large scale ETL string lookups performance issues

I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:
id bigint identity(1,1)
raw_url varchar(2000) not null
md5hash char(32) not null
job_control_number int not null
Clustered unique index on the id and non clustered unique index on md5hash
SQL Server 2008 Enterprise
Page level compression is turned on
We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.
The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.
Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.
I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.
Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.
I would argue that it should be a degenerate dimension in the fact table.
And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that.
Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.

Resources