Recreate index on column store indexed table with 35 billion rows - sql-server

I have a big table that I need to rebuild the index. The table is configured with Clustered Column Store Index (CCI) and we realized we need to sort the data according to specific use case.
User performs date range and equality query but because the data was not sorted in the way they would like to get it back, the query is not optimal. SQL Advisory Team recommended that data are organized in right row group so query can benefit from row group elimination.
Table Description:
Partition by Timestamp1, monthly PF
Total Rows: 31 billion
Est row size: 60 bytes
Est table size: 600 GB
Table Definition:
CREATE TABLE [dbo].[Table1](
[PkId] [int] NOT NULL,
[FKId1] [smallint] NOT NULL,
[FKId2] [int] NOT NULL,
[FKId3] [int] NOT NULL,
[FKId4] [int] NOT NULL,
[Timestamp1] [datetime2](0) NOT NULL,
[Measurement1] [real] NULL,
[Measurement2] [real] NULL,
[Measurement3] [real] NULL,
[Measurement4] [real] NULL,
[Measurement5] [real] NULL,
[Timestamp2] [datetime2](3) NULL,
[TimeZoneOffset] [tinyint] NULL
)
CREATE CLUSTERED COLUMNSTORE INDEX [Table1_ColumnStoreIndex] ON [dbo].[Table1] WITH (DROP_EXISTING = OFF)
GO
Environment:
SQL Server 2014 Enterprise Ed.
8 Cores, 32 GB RAM
VMWare High
Performance Platform
My strategy is:
Drop the existing CCI
Create ordinary Clustered Row Index with the right columns, this will sort the data
Recreate CCI with DROP EXISTING = OFF. This will convert the existing CRI into CCI.
My questions are:
Does it make sense to rebuild the index or just reload the data? Reloading may take a month to complete where as rebuilding the index may take as much time either, maybe...
If I drop the existing CCI, the table will expand as it may not be compressed anymore?

31 billion rows is 31,000 perfect row groups, a rowgroup is just another horizontal partitioning, so it really matters when and how you load your data. SQL 2014 supports only offline index build.
There are a few cons and pros when considering create index vs. reload:
Create index is a single operation, so if it fails at any point you lost your progress. I would not recommend it at your data size.
Index build will create primary dictionaries so for low cardinality dictionary encoded columns it is beneficial.
Bulk load won't create primary dictionaries, but you can reload data if for some reason your batches fail.
Both index build and bulk load will be parallel if you give enough resources, which means your ordering from the base clustered index won't be perfectly preserved, this is just something to be aware of; at your scale of data it won't matter if you have a few overlapping rowgroups.
If your data will undergo updates/deletes and you reorganize (from SQL19 will also do it Tuple Mover) your ordering might degrade over time.
I would create a Clustered Index ordered and partition on the date_range column so that you have anything between 50-200 rowgroups per partition (do some experiments). Then you can create a partition aligned Clustered Columnstore Index and switch in one partition at a time, the partition switch will trigger index build so you'll get the benefit from primary dictionaries and if you end up with updates/deletes on a partition you can fix the index quality up by rebuilding the partition rather than the whole table. If you decide to use reorganize you still maintain some level of ordering, because rowgroups will only be merged within the same partition.

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

XML Index Slows Down Queries

I have a simple table with the following structure, with ~10 million rows:
CREATE TABLE [dbo].[DataPoints](
[ID] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[ModuleID] [uniqueidentifier] NOT NULL,
[DateAndTime] [datetime] NOT NULL,
[Username] [nvarchar](100) NULL,
[Payload] [xml] NULL
)
Payload is similar to this for all rows:
<payload>
<total>1000000</total>
<free>300000</free>
</payload>
The following two queries take around 11 seconds each to execute on my dev machine before creating an index on Payload column:
SELECT AVG(Payload.value('(/payload/total)[1]','bigint')) FROM DataPoints
SELECT COUNT(*) FROM DataPoints
WHERE Payload.value('(/payload/total)[1]','bigint') = 1000000
The problem is when I create an XML index on Payload column, both queries take much longer to complete! I want to know:
1) Why is this happening? Isn't an XML index supposed to speed up queries , or at least a query where a value from the XML column is used in WHERE clause?
2) What would be the proper scenario for using XML indexes, if they are not suitable for my case?
This is on SQL Server 2014.
A ordinary XML Index indexes everything in the XML Payload
Selective XML Indexes (SXI)
The main limitation with ordinary XML indexes is that they index the entire XML document. This leads to several significant drawbacks, such as decreased query performance and increased index maintenance cost, mostly related to the storage costs of the index.
You will want to create a Selective XML index for better performance.
The other option is to create Secondary Indexes
XML Indexes (SQL Server)
To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes.
So the purpose of the Primary Index is so you can create the secondary indexes

Partitioning in SQL Server Standard Edition with billion of rows

hi would like to ask about how to partition the following table (see below). The problem i'm having is not in the retrieval of History records which was resolved by the clustered Index. But as you can see the index is based on the HistoryParameterID then TimeStamp, this is needed because the retrieval of rows are based on the columns stated above.
The problem here is that whenever it reaches ~1 billion records, inserts are slowing down since the scenario is there will be 15k rows\second (note this can be 30k - 100k) to be inserted and per row it corresponds to a HistoryParameterID.
Basically, the HistoryParameterID is not unique , it has a one -> many relation ship with the other columns of the table below.
My hunch is that because of the index, it slows down the inserts because inserts are not always at the bottom because it is arranged by HistoryParameterID.
I did some testing using Timestamp as index but to no avail since query performance is unacceptable.
is there any way to partition this by history ParameterID? I was trying it so i created 15k Tables for partition view. But when i created the view it didn't finish executing. Any tips? or is there any way to partition ? Please note that i'm using Standard edition and using enterprise edition is not an option.
CREATE TABLE [dbo].[HistorySampleValues]
(
[HistoryParameterID] [int] NOT NULL,
[SourceTimeStamp] [datetime2](7) NOT NULL,
[ArchiveTimestamp] [datetime2](7) NOT NULL CONSTRAINT [DF__HistorySa__Archi__2A164134] DEFAULT (getutcdate()),
[ValueStatus] [int] NOT NULL,
[ArchiveStatus] [int] NOT NULL,
[IntegerValue] [bigint] SPARSE NULL,
[DoubleValue] [float] SPARSE NULL,
[StringValue] [varchar](100) SPARSE NULL,
[EnumNamedSetName] [varchar](100) SPARSE NULL,
[EnumNumericValue] [int] SPARSE NULL,
[EnumTextualValue] [varchar](256) SPARSE NULL
) ON [PRIMARY]
CREATE CLUSTERED INDEX [Source_HistParameterID_Index] ON [dbo].[HistorySampleValues]
(
[HistoryParameterID] ASC,
[SourceTimeStamp] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
I was trying it so i created 15k Tables for partition view. But when
i created the view it didn't finish executing. Any tips? or is there
any way to partition ? Please note that i'm using Standard edition and
using enterprise edition is not an option.
If you go down the partitioned view path (http://technet.microsoft.com/en-us/library/ms190019.aspx), I suggest fewer tables (under one hundred). Without partitioned tables, the optimizer must go through a lot of work since each table of the view could be indexed differently.
I would not expect inserts to slow down with table size if HistoryParameterID is incremental. However, in the case of a random value, inserts will become progressively slower as the table size grows due to lower buffer cache efficiency. That problem will exist with a single table, partitioned table, or partitioned view. See http://www.dbdelta.com/improving-uniqueidentifier-performance/ for an example using a guid but the issue applies to any random key value.
You might try a single table with SourceTimestamp alone as the clustered index key and a non-clustered index on HistoryID nad SourceTimestamp. That would provide the best insert performance and the non-clustered index (maybe with included columns) might be good enough for your select queries.
Everything you need is here. I'll hope you can figure it out.
http://msdn.microsoft.com/en-us/library/ms188730.aspx
and for Standard Edition alternative solutions exist like this answer.
and this is an interesting article too.
also we implement that in our enterprise automation application with custom indexing around table of users and it worked well.
Here's the cons and pros of custom implementation:
Pros:
Higher performance that partitioned table because of application's logic awareness.
Cons:
Implementing routing method and updating indexes.
Un-Centralized data.

Sql Server Primary Key With Partition Issue

I am building a table that will be partitioned and contain a FILESTREAM column. The issue I am encountering is that it appears I have to have a composite primary key (FILE_ID and FILE_UPLOADED_DATE) because FILE_UPLOADED_DATE is part of my partition scheme. Is that correct? I would prefer not to have this be a composite key and simply just have FILE_ID being the primary key.....could this be just an user error?
Any suggestions would be appreciated.
Version: SQL Server 2008 R2
Partition Schemes and Function:
CREATE PARTITION FUNCTION DocPartFunction (datetime)
AS RANGE RIGHT FOR VALUES ('20101220')
GO
CREATE PARTITION SCHEME DocPartScheme AS
PARTITION DocPartFunction TO (DATA_FG_20091231, DATA_FG_20101231);
GO
CREATE PARTITION SCHEME DocFSPartScheme AS
PARTITION DocPartFunction TO (FS_FG_20091231,FS_FG_20101231);
GO
Create Statement:
CREATE TABLE [dbo].[FILE](
[FILE_ID] [int] IDENTITY(1,1) NOT NULL,
[DOCUMENT] [varbinary](max) FILESTREAM NULL,
[FILE_UPLOADED_DATE] [datetime] NOT NULL,
[FILE_INT] [int] NOT NULL,
[FILE_EXTENSION] [varchar](10) NULL,
[DocGUID] [uniqueidentifier] ROWGUIDCOL NOT NULL UNIQUE ON [PRIMARY],
CONSTRAINT [PK_File] PRIMARY KEY CLUSTERED
( [FILE_ID] ASC
) ON DocPartScheme ([FILE_UPLOADED_DATE])
)ON DocPartScheme ([FILE_UPLOADED_DATE])
FILESTREAM_ON DocFSPartScheme;
Error if I don't include FILE_UPLOADED_DATE:
Msg 1908, Level 16, State 1, Line 1
Column 'FILE_UPLOADED_DATE' is partitioning column of the index 'PK_File'. Partition columns for a unique index must be a subset of the index key.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.
Thanks!
You are confusing the primary key and the clustered index. There is no reason for the two to be one and the same. You can have a clustered index on FILE_UPLOADED_DATE and a separate, non-clustered, primary key on FILE_ID. In fact you already do something similar for the DocGUID column:
CREATE TABLE [dbo].[FILE](
[FILE_ID] [int] IDENTITY(1,1) NOT NULL,
[DOCUMENT] [varbinary](max) FILESTREAM NULL,
[FILE_UPLOADED_DATE] [datetime] NOT NULL,
[FILE_INT] [int] NOT NULL,
[FILE_EXTENSION] [varchar](10) NULL,
[DocGUID] [uniqueidentifier] ROWGUIDCOL NOT NULL,
constraint UniqueDocGUID UNIQUE NONCLUSTERED ([DocGUID])
ON [PRIMARY])
ON DocPartScheme ([FILE_UPLOADED_DATE])
FILESTREAM_ON DocFSPartScheme;
CREATE CLUSTERED INDEX cdx_File
ON [FILE] (FILE_UPLOADED_DATE)
ON DocPartScheme ([FILE_UPLOADED_DATE])
FILESTREAM_ON DocFSPartScheme;
ALTER TABLE [dbo].[FILE]
ADD CONSTRAINT PK_File PRIMARY KEY NONCLUSTERED (FILE_ID)
ON [PRIMARY];
However such a design will lead to non-aligned indexes which can cause very serious performance problems, and also block all fast partition switch operations. See Special Guidelines for Partitioned Indexes:
Each sort table requires a minimum amount of memory to build. When you
are building a partitioned index that is aligned with its base table,
sort tables are built one at a time, using less memory. However, when
you are building a nonaligned partitioned index, the sort tables are
built at the same time.
As a result, there must be sufficient memory to handle these
concurrent sorts. The larger the number of partitions, the more memory
required. The minimum size for each sort table, for each partition, is
40 pages, with 8 kilobytes per page. For example, a nonaligned
partitioned index with 100 partitions requires sufficient memory to
serially sort 4,000 (40 * 100) pages at the same time. If this memory
is available, the build operation will succeed, but performance may
suffer. If this memory is not available, the build operation will fail
Your design already has a non-aligned index for DocGUID, so the performance problems are likely already present. If you must keep your indexes aligned then you have to admit one of the side effects of choosing a partition scheme: you can no longer have a logical primary key, nor unique constraints enforcement, unless the key includes the partitioning key.
And finally, one must ask: why use a partitioned table? They are always slower than a non-partitioned alternative. Unless you need fast partition switch operations for ETL (which you are already punting due to the non-aligned index on DocGUID), there is basically no incentive to use a partitioned table. (Preemptive comment: clustered index on the FILE_UPLOADED_DATE is guaranteed a better alternative than 'partition elimination').
The partitioning column must always be present in a partitioned table's clustered index. Any work-around you come up with has to factor this in.
I know, its an old question, but maybe google leads someone else to this question:
A possible solution would be not to partition by the date-column but by the File_ID. Every day / week / month (or whatever time period you use) you have to run a Agent Job at midnight that takes the Max(File_ID) where file_uploadet_date < GetDate(), adds the next filegroup to the partition scheme and does a split on the MaxID + 1.
Of course you will still have the problem with the non aligned index on the DocID, except you eighter add the file_id to this unique index too (could cause non unique DocIds) and / or check its uniqueness in an insert / update trigger.

SQL design for various data types

I need to store data in a SQL Server 2008 database from various data sources with different data types. Data types allowed are: Bit, Numeric (1, 2 or 4 bytes), Real and String. There is going to be a value, a timestamp, a FK to the item of which the value belongs and some other information for the data stored.
The most important points are the read performance and the size of the data. There might be a couple thousand items and each item may have millions of values.
I have 5 possible options:
Separate tables for each data type (ValueBit, ValueTinyInt, ValueSmallInt, etc... tables)
Separate tables with inheritance (Value table as base table, ValueBit table just for storing the Bit value, etc...)
Single value table for all data types, with separate fields for each data type (Value table, with ValueBit BIT, ValueTinyInt TINYINT etc...)
Single table and single value field using sql_variant
Single table and single value field using UDT
With case 2, a PK is a must, and,
1000 item * 10 000 000 data each > Int32.Max, and,
1000 item * 10 000 000 data each * 8 byte BigInt PK is huge
Other than that, I am considering 1 or 3 with no PK. Will they differ in size?
I do not have experience with 4 or 5 and I do not think that they will perform well in this scenario.
Which way shall I go?
Your question is hard to answer as you seem to use a relational database system for something it is not designed for. The data you want to keep in the database seems to be too unstructured for getting much benefit from a relational database system. Database designs with mostly fields like "parameter type" and "parameter value" that try to cover very generic situations are mostly considered to be bad designs. Maybe you should consider using a "non relational database" like BigTable. If you really want to use a relational database system, I'd strongly recommend to read Beginning Database Design by Clare Churcher. It's an easy read, but gets you on the right track with respect to RDBS.
What are usage scenarios? Start with samples of queries and calculate necessary indexes.
Consider data partitioning as mentioned before. Try to understand your data / relations more. I believe the decision should be based on business meaning/usages of the data.
I think it's a great question - This situation is fairly common, though it is awkward to make tables to support it.
In terms of performance, having a table like indicated in #3 potentially wastes a huge amount of storage and RAM because for each row you allocate space for a value of every type, but only use one. If you use the new sparse table feature of 2008 it could help, but there are other issues too: it's a little hard to constrain/normalize, because you want only only one of the multiple values to be populated for each row - having two values in two columns would be an error, but the design doesn't reflect that. I'd cross that off.
So, if it were me I'd be looking at option 1 or 2 or 4, and the decision would be driven by this: do I typically need to make one query returning rows that have a mix of values of different types in the same result set? Or will I almost always ask for the rows by item and by type. I ask because if the values are different types it implies to me some difference in the source or the use of that data (you are unlikely, for example, to compare a string and a real, or a string and a bit.) This is relevant because having different tables per type might actually be a significant performance/scalability advantage, if partitioning the data that way makes queries faster. Partitioning data into smaller sets of more closely related data can give a performance advantage.
It's like having all the data in one massive (albeit sorted) set or having it partitioned into smaller, related sets. The smaller sets favor some types of queries, and if those are the queries you will need, it's a win.
Details:
CREATE TABLE [dbo].[items](
[itemid] [int] IDENTITY(1,1) NOT NULL,
[item] [varchar](100) NOT NULL,
CONSTRAINT [PK_items] PRIMARY KEY CLUSTERED
(
[itemid] ASC
)
)
/* This table has the problem of allowing two values
in the same row, plus allocates but does not use a
lot of space in memory and on disk (bad): */
CREATE TABLE [dbo].[vals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[valueBit] [bit] NULL,
[valueNumericA] [numeric](2, 0) NULL,
[valueNumericB] [numeric](8, 2) NULL,
[valueReal] [real] NULL,
[valueString] [varchar](100) NULL,
CONSTRAINT [PK_vals] PRIMARY KEY CLUSTERED
(
[itemid] ASC,
[datestamp] ASC
)
)
ALTER TABLE [dbo].[vals] WITH CHECK
ADD CONSTRAINT [FK_vals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[vals] CHECK CONSTRAINT [FK_vals_items]
GO
/* This is probably better, though casting is required
all the time. If you search with the variant as criteria,
that could get dicey as you have to be careful with types,
casting and indexing. Also everything is "mixed" in one
giant set */
CREATE TABLE [dbo].[allvals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[value] [sql_variant] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[allvals] WITH CHECK
ADD CONSTRAINT [FK_allvals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[allvals] CHECK CONSTRAINT [FK_allvals_items]
GO
/* This would be an alternative, but you trade multiple
queries and joins for the casting issue. OTOH the implied
partitioning might be an advantage */
CREATE TABLE [dbo].[valsBits](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] [bit] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[valsBits] WITH CHECK
ADD CONSTRAINT [FK_valsBits_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[valsBits] CHECK CONSTRAINT [FK_valsBits_items]
GO
CREATE TABLE [dbo].[valsNumericA](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric( 2, 0 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
CREATE TABLE [dbo].[valsNumericB](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric ( 8, 2 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
etc...

Resources