What's the best way to index this table - sql-server

I have the following table:
CREATE TABLE [dbo].[HousePrices](
[Id] [int] IDENTITY(1,1) NOT NULL,
[PropertyType] [int] NULL,
[Town] [nvarchar](500) NULL,
[County] [nvarchar](500) NULL,
[Outcode] [nvarchar](10) NULL,
[Price] [int] NULL
PRIMARY KEY CLUSTERED
(
[Id] ASC
)
Which currently holds around 20 million records, and I need to run queries to calculate the average price in a certain area. For example:
select avg(price)
from houseprices
where town = 'London'
and propertytype = 1
The WHERE clause could have any combination of Town, County or Outcode, and will probably always have PropertyType (which is one of four values). I've tried creating a non-clustered index on one of the fields, but that still took around 2 minutes to run.
Surely this should be able to run in under a second?

It depends.
If your WHERE clause only returns a small subset of the records, then create an index for each combination of search values, e.g. one multi-field index on PropertyType, Town, Country, Outcode, another on PropertyType, Country, Outcode, etc. You can skip indexes which are prefixes of existing indexes (i.e. if you have an index A, B, C, D, you don't need A, B, C; however, you do need A, C, D, if B can be omitted).
You can reduce the number of required indexes by reducing the number of combinations: For example, you could make Country mandatory when searching for Town -- which would make sense, since getting the average over Vienna (Austria) and Vienna (Virgina) would be quite useless.
If your WHERE clause returns a large set of records, your query will take a lot of time anyway, since all the selected records need to be fetched from the HDD or cache to calculate the average. In this case, you can increase performance by including the Price column in your indexes as an included column. This means that your query will only have to fetch the index rather than the actual rows.

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

Will a table with an int primary key outperform its uid equivalent?

I am working with a legacy Sql Server database which uses UNIQUEIDENTIFIER and am considering performance. If I have two tables, idencial except for the Identity column, something like this:
CREATE TABLE [uidExampleTable] (
[exampleUid] UNIQUEIDENTIFIER CONSTRAINT [DF_uidExampleTable_uid] DEFAULT (newid()) NOT NULL,
[name] VARCHAR (50) NOT NULL,
[createdDate] DATETIME NOT NULL,
CONSTRAINT [PK_uidExampleTable] PRIMARY KEY CLUSTERED ([exampleUid] ASC));
CREATE TABLE [intExampleTable] (
[exampleIntId] INT IDENTITY (1, 1) NOT NULL,
[name] VARCHAR (50) NOT NULL,
[createdDate] DATETIME NOT NULL,
CONSTRAINT [PK_intExampleTable] PRIMARY KEY CLUSTERED ([[exampleIntId] ASC));
And I fill these tables with, say, ten million rows each, then perform a select on each:
Select top 20 * from uidExampleTable order by createdDate desc
Select top 20 * from intExampleTable order by createdDate desc
Would you expect the second query on intExampleTable to return results more quickly?
Both tables have an index. Whether or not there is an index on the table is determined by the PRIMARY KEY directive, rather than the type of the key field.
However, these indexes won't help those queries for either table.
There are still some performance differences, though. The UNIQUEIDENTIFIER (hereafter UID, because I'm lazy) adds an extra 12 bytes for each row. Assuming the average name length is 10 characters out of a possible 50, that should work out to 38 bytes per row* on average for the int table and 50 bytes per row on average for the UID table, which is more than a 30% increase in row size.
So yes, that can make a difference over 10 million records. Keep in mind, though, for many tables you'll have a lot more data in the table, and the relative difference starts to diminish as the width of the table increases.
The other place you'll have a performance difference is INSERT statements. With an IDENTITY column, an INSERT is naturally already in primary key order and new records are simply appended to the end of the last page (or the beginning of a new page, if the last page was full). A UID, though, is more random, where you usually need to insert into the middle of a page somewhere. You can offset this a bit by changing the FILL FACTOR for your index, but that comes at the cost of needing more pages. This is one reason we also have sequential UIDs.
Even so, these differences tend to be small compared to other factors. Sometimes they can be important, but you generally need to measure your system's performance first to know for sure.
For example, for this query, rather than worrying about the UID vs INT for the key, you can really improve things by adding a descending index for the createdDate column. Definitely if you know you could have more than 4 billion rows, or it could be dangerous if people could guess an ID to get a valid record, don't let a little bit of performance out weigh those concerns.
* 14 bytes row overhead + 4 bytes int ID + 2 bytes varchar overhead + 10 bytes varchar data + 8 bytes datetime = 38 bytes total
Yes it will, except when you have 1,000’s of inserts per second and your storage cannot handle it, then you will get contention for the writes.

Indexing columns in SQL Server

I have the following table
CREATE TABLE [dbo].[ActiveHistory]
(
[ID] [INT] IDENTITY(1,1) NOT NULL,
[Date] [VARCHAR](250) NOT NULL,
[ActiveID] [INT] NOT NULL,
[UserID] [INT] NOT NULL,
CONSTRAINT [PK_ActiveHistory]
PRIMARY KEY CLUSTERED ([ID] ASC)
)
About 600,000 rows are inserted into the table per day that means 300,000 distinct actives for one date with about 500 distinct users. I would like to have about 5 year history in one table that means more then bln rows, in overall about 4,000 distinct userid and 1,000,000 distinct actives are placed in 5 year table. it is very important for me to work faster with this table,
Most of the queries in the past used joins with date and userid but in last days I have to include activeid quite often, but sometimes just two of them could be used (any pairs).
I never use ID in join.
Now I have nonclustered index with userid and date as index key columns and ID and ActiveID as included columns, Now my question is - how to best arrange the index for this table considering new challenges, just add all options as index may use huge place and sometimes application that uses the same server is suffering as CPU usage goes to 99%, I am not sure how new indexes will effect on that.

Recreate index on column store indexed table with 35 billion rows

I have a big table that I need to rebuild the index. The table is configured with Clustered Column Store Index (CCI) and we realized we need to sort the data according to specific use case.
User performs date range and equality query but because the data was not sorted in the way they would like to get it back, the query is not optimal. SQL Advisory Team recommended that data are organized in right row group so query can benefit from row group elimination.
Table Description:
Partition by Timestamp1, monthly PF
Total Rows: 31 billion
Est row size: 60 bytes
Est table size: 600 GB
Table Definition:
CREATE TABLE [dbo].[Table1](
[PkId] [int] NOT NULL,
[FKId1] [smallint] NOT NULL,
[FKId2] [int] NOT NULL,
[FKId3] [int] NOT NULL,
[FKId4] [int] NOT NULL,
[Timestamp1] [datetime2](0) NOT NULL,
[Measurement1] [real] NULL,
[Measurement2] [real] NULL,
[Measurement3] [real] NULL,
[Measurement4] [real] NULL,
[Measurement5] [real] NULL,
[Timestamp2] [datetime2](3) NULL,
[TimeZoneOffset] [tinyint] NULL
)
CREATE CLUSTERED COLUMNSTORE INDEX [Table1_ColumnStoreIndex] ON [dbo].[Table1] WITH (DROP_EXISTING = OFF)
GO
Environment:
SQL Server 2014 Enterprise Ed.
8 Cores, 32 GB RAM
VMWare High
Performance Platform
My strategy is:
Drop the existing CCI
Create ordinary Clustered Row Index with the right columns, this will sort the data
Recreate CCI with DROP EXISTING = OFF. This will convert the existing CRI into CCI.
My questions are:
Does it make sense to rebuild the index or just reload the data? Reloading may take a month to complete where as rebuilding the index may take as much time either, maybe...
If I drop the existing CCI, the table will expand as it may not be compressed anymore?
31 billion rows is 31,000 perfect row groups, a rowgroup is just another horizontal partitioning, so it really matters when and how you load your data. SQL 2014 supports only offline index build.
There are a few cons and pros when considering create index vs. reload:
Create index is a single operation, so if it fails at any point you lost your progress. I would not recommend it at your data size.
Index build will create primary dictionaries so for low cardinality dictionary encoded columns it is beneficial.
Bulk load won't create primary dictionaries, but you can reload data if for some reason your batches fail.
Both index build and bulk load will be parallel if you give enough resources, which means your ordering from the base clustered index won't be perfectly preserved, this is just something to be aware of; at your scale of data it won't matter if you have a few overlapping rowgroups.
If your data will undergo updates/deletes and you reorganize (from SQL19 will also do it Tuple Mover) your ordering might degrade over time.
I would create a Clustered Index ordered and partition on the date_range column so that you have anything between 50-200 rowgroups per partition (do some experiments). Then you can create a partition aligned Clustered Columnstore Index and switch in one partition at a time, the partition switch will trigger index build so you'll get the benefit from primary dictionaries and if you end up with updates/deletes on a partition you can fix the index quality up by rebuilding the partition rather than the whole table. If you decide to use reorganize you still maintain some level of ordering, because rowgroups will only be merged within the same partition.

SQL design for various data types

I need to store data in a SQL Server 2008 database from various data sources with different data types. Data types allowed are: Bit, Numeric (1, 2 or 4 bytes), Real and String. There is going to be a value, a timestamp, a FK to the item of which the value belongs and some other information for the data stored.
The most important points are the read performance and the size of the data. There might be a couple thousand items and each item may have millions of values.
I have 5 possible options:
Separate tables for each data type (ValueBit, ValueTinyInt, ValueSmallInt, etc... tables)
Separate tables with inheritance (Value table as base table, ValueBit table just for storing the Bit value, etc...)
Single value table for all data types, with separate fields for each data type (Value table, with ValueBit BIT, ValueTinyInt TINYINT etc...)
Single table and single value field using sql_variant
Single table and single value field using UDT
With case 2, a PK is a must, and,
1000 item * 10 000 000 data each > Int32.Max, and,
1000 item * 10 000 000 data each * 8 byte BigInt PK is huge
Other than that, I am considering 1 or 3 with no PK. Will they differ in size?
I do not have experience with 4 or 5 and I do not think that they will perform well in this scenario.
Which way shall I go?
Your question is hard to answer as you seem to use a relational database system for something it is not designed for. The data you want to keep in the database seems to be too unstructured for getting much benefit from a relational database system. Database designs with mostly fields like "parameter type" and "parameter value" that try to cover very generic situations are mostly considered to be bad designs. Maybe you should consider using a "non relational database" like BigTable. If you really want to use a relational database system, I'd strongly recommend to read Beginning Database Design by Clare Churcher. It's an easy read, but gets you on the right track with respect to RDBS.
What are usage scenarios? Start with samples of queries and calculate necessary indexes.
Consider data partitioning as mentioned before. Try to understand your data / relations more. I believe the decision should be based on business meaning/usages of the data.
I think it's a great question - This situation is fairly common, though it is awkward to make tables to support it.
In terms of performance, having a table like indicated in #3 potentially wastes a huge amount of storage and RAM because for each row you allocate space for a value of every type, but only use one. If you use the new sparse table feature of 2008 it could help, but there are other issues too: it's a little hard to constrain/normalize, because you want only only one of the multiple values to be populated for each row - having two values in two columns would be an error, but the design doesn't reflect that. I'd cross that off.
So, if it were me I'd be looking at option 1 or 2 or 4, and the decision would be driven by this: do I typically need to make one query returning rows that have a mix of values of different types in the same result set? Or will I almost always ask for the rows by item and by type. I ask because if the values are different types it implies to me some difference in the source or the use of that data (you are unlikely, for example, to compare a string and a real, or a string and a bit.) This is relevant because having different tables per type might actually be a significant performance/scalability advantage, if partitioning the data that way makes queries faster. Partitioning data into smaller sets of more closely related data can give a performance advantage.
It's like having all the data in one massive (albeit sorted) set or having it partitioned into smaller, related sets. The smaller sets favor some types of queries, and if those are the queries you will need, it's a win.
Details:
CREATE TABLE [dbo].[items](
[itemid] [int] IDENTITY(1,1) NOT NULL,
[item] [varchar](100) NOT NULL,
CONSTRAINT [PK_items] PRIMARY KEY CLUSTERED
(
[itemid] ASC
)
)
/* This table has the problem of allowing two values
in the same row, plus allocates but does not use a
lot of space in memory and on disk (bad): */
CREATE TABLE [dbo].[vals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[valueBit] [bit] NULL,
[valueNumericA] [numeric](2, 0) NULL,
[valueNumericB] [numeric](8, 2) NULL,
[valueReal] [real] NULL,
[valueString] [varchar](100) NULL,
CONSTRAINT [PK_vals] PRIMARY KEY CLUSTERED
(
[itemid] ASC,
[datestamp] ASC
)
)
ALTER TABLE [dbo].[vals] WITH CHECK
ADD CONSTRAINT [FK_vals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[vals] CHECK CONSTRAINT [FK_vals_items]
GO
/* This is probably better, though casting is required
all the time. If you search with the variant as criteria,
that could get dicey as you have to be careful with types,
casting and indexing. Also everything is "mixed" in one
giant set */
CREATE TABLE [dbo].[allvals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[value] [sql_variant] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[allvals] WITH CHECK
ADD CONSTRAINT [FK_allvals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[allvals] CHECK CONSTRAINT [FK_allvals_items]
GO
/* This would be an alternative, but you trade multiple
queries and joins for the casting issue. OTOH the implied
partitioning might be an advantage */
CREATE TABLE [dbo].[valsBits](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] [bit] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[valsBits] WITH CHECK
ADD CONSTRAINT [FK_valsBits_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[valsBits] CHECK CONSTRAINT [FK_valsBits_items]
GO
CREATE TABLE [dbo].[valsNumericA](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric( 2, 0 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
CREATE TABLE [dbo].[valsNumericB](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric ( 8, 2 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
etc...

Resources