Deciding clustered index in Microsoft SQL Server - sql-server

We are creating this “clients” table that will have around 50 million records.
I am having a hard time deciding the ‘clustered index’.
Theory says that it should be: Unique,Narrow,Static, Ever-increasing pattern… But in practice it should be the key you use to refer to your records most often.
The table has 50 columns…
Per the first approach the CI should be:
[Client_id] [bigint] IDENTITY(1,1) NOT NULL,
But I feel tempted to use:
[SF_id] [varchar](18) NOT NULL,
or
[UpdateDate] [datetime] NOT NULL,
or
[SystemModStamp] [datetime] NOT NULL,
Reality is that I do not exactly how the end users will query the table: but, I know they will use SF_id quite often and I know they will rarely use Client_id… And I also know, me myself I will use UpdateDate or SystemModStamp (not sure yet), I will use it as the key for ‘delta’ daily merges that I will set up in a Job/SP.

Related

Strategy to keep updated summary tables standing by (SQL Server)

I've got a client portal project (the first one I've developed so a basic best practice is what I'm looking for here, nothing fancy) nearing first release.
A simplification of the main record types used in reporting is the following:
CREATE TABLE [dbo].[conversions](
[conversion_id] [nvarchar](128) primary key NOT NULL,
[click_id] [int] NULL,
[conversion_date] [datetime] NOT NULL,
[last_updated] [datetime] NULL,
[click_date] [datetime] NULL,
[affiliate_affiliate_id] [int] NOT NULL,
[advertiser_advertiser_id] [int] NOT NULL,
[offer_offer_id] [int] NOT NULL,
[creative_creative_id] [int] NOT NULL,
[conversion_type] [nvarchar](max) NULL)
CREATE TABLE [dbo].[clicks](
[click_id] [int] primary key NOT NULL,
[click_date] [datetime] NOT NULL,
[affiliate_affiliate_id] [int] NOT NULL,
[advertiser_advertiser_id] [int] NOT NULL,
[offer_offer_id] [int] NOT NULL,
[campaign_id] [int] NOT NULL,
[creative_creative_id] [int] NOT NULL,
[ip_address] [nvarchar](max) NULL,
[user_agent] [nvarchar](max) NULL,
[referrer_url] [nvarchar](max) NULL,
[region_region_code] [nvarchar](max) NULL,
[total_clicks] [int] NOT NULL)
My specific question is: given millions of rows in each table, what mechanism is used to serve up summary reports quickly on demand given you know all the possible reports that can be requested?
The starting point, performance wise, doing raw queries against a 18 months worth of data for the busiest client is yielding a 3 to 5 second latency on my dashboard and the worst case is upwards of 10 seconds for a summary report with a custom date range spanning all the rows.
I know I can cache them after the first hit, but I want snappy performance on the first hit.
My feeling is this is a fundamental aspect of an application of this nature and that there are tons of applications like this out there, so is there an already well-thought-out method to pre-calculating tables that already did the grouping and aggregation? Then how do you keep them up to date? Do you use SQL agent and custom console apps that brute force the calculations before hand?
Any general pointers would be very appreciated..
Both tables are time series. They seem to be clustered by an ID column which has little value for how time series are queried. Time series are almost always queried by date range, so your clustered organization should service this type of queries first and foremost: cluster by date, move the ID primary key constraint into a non-clustered.
CREATE TABLE [dbo].[conversions](
[conversion_id] [nvarchar](128) NOT NULL,
[conversion_date] [datetime] NOT NULL,
...
constraint pk_conversions nonclustered primary key ([conversion_id]))
go
create clustered index [cdx_conversions] on [dbo].[conversions]([conversion_date]);
go
CREATE TABLE [dbo].[clicks](
[click_id] [int] NOT NULL,
[click_date] [datetime] NOT NULL,
...
constraint [pk_clicks] nonclustered [click_id]);
go
create clustered index [cdx_clicks] on [dbo].[clicks]([click_date]);
This model will serve the typical queries that filter by a range on [click_date] and on [conversion_date]. For any other query the answer will be very specific to your query.
There are limits on how useful a relational row organized model can be for an OLAP/DW workload like yours. Specialized tools do a better job at it. Columnstore indexes can deliver amazingly fast responses, but they are difficult to update. Creating a MOLAP cube can also deliver blazing results but that is a serious project undertaking. There are even specialized time series databases out there.

Identity column without index or unique constraint

I just experienced a database breakdown due to sudden extradordinary data loading from disk.
I found the issue would arise when I attempted inserting into a log table with approx. 3.5 million rows. The table features an ID column set to IDENTITY, but with no indexes or unique constraints.
CREATE TABLE [dbo].[IntegrationTestLog](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Ident] [varchar](50) NULL,
[Date] [datetime] NOT NULL,
[Thread] [varchar](255) NOT NULL,
[Level] [varchar](50) NOT NULL,
[Logger] [varchar](255) NOT NULL,
[Message] [varchar](max) NOT NULL,
[Exception] [varchar](max) NULL
)
Issue triggered by this line:
INSERT INTO IntegrationTestLog ([Ident],[Date],[Thread],[Level],[Logger],[Message],[Exception]) VALUES (#Ident, #log_date, #thread, #log_level, #logger, #message, #exception)
There are possibly many other queries that will trigger it, but this one I know for sure.
Bear with me, cuz' Im only guessing now, but does the identity seeding process somehow slow down if an index is missing? Could it by any slight chance fall back to doing a MAX(ID) query to get the latest entry? (Probably not). I haven't succeeded in finding any deep technical information about the subject yet. Please share if you know some litterature or links to such.
To solve the issue, we ended up truncating the table, which itself took VERY long. I also promoted ID to be primary key.
Then I read this article: Identity columns and found that truncate actually does touch the identity seed.
A truncate table (but not delete) will update the current seed to the
original seed value.
...which again only led me to be more suspecious of the identity seed.
Again I'm searching in the dark - please enlighten me on this issue if you have the insight.

SQL design for various data types

I need to store data in a SQL Server 2008 database from various data sources with different data types. Data types allowed are: Bit, Numeric (1, 2 or 4 bytes), Real and String. There is going to be a value, a timestamp, a FK to the item of which the value belongs and some other information for the data stored.
The most important points are the read performance and the size of the data. There might be a couple thousand items and each item may have millions of values.
I have 5 possible options:
Separate tables for each data type (ValueBit, ValueTinyInt, ValueSmallInt, etc... tables)
Separate tables with inheritance (Value table as base table, ValueBit table just for storing the Bit value, etc...)
Single value table for all data types, with separate fields for each data type (Value table, with ValueBit BIT, ValueTinyInt TINYINT etc...)
Single table and single value field using sql_variant
Single table and single value field using UDT
With case 2, a PK is a must, and,
1000 item * 10 000 000 data each > Int32.Max, and,
1000 item * 10 000 000 data each * 8 byte BigInt PK is huge
Other than that, I am considering 1 or 3 with no PK. Will they differ in size?
I do not have experience with 4 or 5 and I do not think that they will perform well in this scenario.
Which way shall I go?
Your question is hard to answer as you seem to use a relational database system for something it is not designed for. The data you want to keep in the database seems to be too unstructured for getting much benefit from a relational database system. Database designs with mostly fields like "parameter type" and "parameter value" that try to cover very generic situations are mostly considered to be bad designs. Maybe you should consider using a "non relational database" like BigTable. If you really want to use a relational database system, I'd strongly recommend to read Beginning Database Design by Clare Churcher. It's an easy read, but gets you on the right track with respect to RDBS.
What are usage scenarios? Start with samples of queries and calculate necessary indexes.
Consider data partitioning as mentioned before. Try to understand your data / relations more. I believe the decision should be based on business meaning/usages of the data.
I think it's a great question - This situation is fairly common, though it is awkward to make tables to support it.
In terms of performance, having a table like indicated in #3 potentially wastes a huge amount of storage and RAM because for each row you allocate space for a value of every type, but only use one. If you use the new sparse table feature of 2008 it could help, but there are other issues too: it's a little hard to constrain/normalize, because you want only only one of the multiple values to be populated for each row - having two values in two columns would be an error, but the design doesn't reflect that. I'd cross that off.
So, if it were me I'd be looking at option 1 or 2 or 4, and the decision would be driven by this: do I typically need to make one query returning rows that have a mix of values of different types in the same result set? Or will I almost always ask for the rows by item and by type. I ask because if the values are different types it implies to me some difference in the source or the use of that data (you are unlikely, for example, to compare a string and a real, or a string and a bit.) This is relevant because having different tables per type might actually be a significant performance/scalability advantage, if partitioning the data that way makes queries faster. Partitioning data into smaller sets of more closely related data can give a performance advantage.
It's like having all the data in one massive (albeit sorted) set or having it partitioned into smaller, related sets. The smaller sets favor some types of queries, and if those are the queries you will need, it's a win.
Details:
CREATE TABLE [dbo].[items](
[itemid] [int] IDENTITY(1,1) NOT NULL,
[item] [varchar](100) NOT NULL,
CONSTRAINT [PK_items] PRIMARY KEY CLUSTERED
(
[itemid] ASC
)
)
/* This table has the problem of allowing two values
in the same row, plus allocates but does not use a
lot of space in memory and on disk (bad): */
CREATE TABLE [dbo].[vals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[valueBit] [bit] NULL,
[valueNumericA] [numeric](2, 0) NULL,
[valueNumericB] [numeric](8, 2) NULL,
[valueReal] [real] NULL,
[valueString] [varchar](100) NULL,
CONSTRAINT [PK_vals] PRIMARY KEY CLUSTERED
(
[itemid] ASC,
[datestamp] ASC
)
)
ALTER TABLE [dbo].[vals] WITH CHECK
ADD CONSTRAINT [FK_vals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[vals] CHECK CONSTRAINT [FK_vals_items]
GO
/* This is probably better, though casting is required
all the time. If you search with the variant as criteria,
that could get dicey as you have to be careful with types,
casting and indexing. Also everything is "mixed" in one
giant set */
CREATE TABLE [dbo].[allvals](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[value] [sql_variant] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[allvals] WITH CHECK
ADD CONSTRAINT [FK_allvals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[allvals] CHECK CONSTRAINT [FK_allvals_items]
GO
/* This would be an alternative, but you trade multiple
queries and joins for the casting issue. OTOH the implied
partitioning might be an advantage */
CREATE TABLE [dbo].[valsBits](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] [bit] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[valsBits] WITH CHECK
ADD CONSTRAINT [FK_valsBits_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO
ALTER TABLE [dbo].[valsBits] CHECK CONSTRAINT [FK_valsBits_items]
GO
CREATE TABLE [dbo].[valsNumericA](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric( 2, 0 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
CREATE TABLE [dbo].[valsNumericB](
[itemid] [int] NOT NULL,
[datestamp] [datetime] NOT NULL,
[val] numeric ( 8, 2 ) NOT NULL
) ON [PRIMARY]
GO
... FK constraint ...
etc...

DB advice needed for performance of a 'SessionVisit' table

I have a 'SessionVisit' table which collects data about user visits.
The script for this table is below. There may be 25,000 rows added a day.
The table CREATE statement is below. My database knowledge is definitely not up to scratch as far as understanding the implications of such a schema.
Can anyone give me their 2c of advice on some of these issues :
Do I need to worry about ROWSIZE for this schema for SQL Server 2008. I'm not even sure how the 8kb rowsize works in 2008. I don't even know if I'm wasting a lot of space if I'm not using all 8kb?
How should I purge old records I don't want. Will new rows fill in the empty spaces from dropped rows?
Any advice on indexes
I know this is quite general in nature. Any 'obvious' or non obvious info would be appreciated.
Here's the table :
USE [MyDatabase]
GO
/****** Object: Table [dbo].[SessionVisit] Script Date: 06/06/2009 16:55:05 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[SessionVisit](
[SessionGUID] [uniqueidentifier] NOT NULL,
[SessionVisitId] [int] IDENTITY(1,1) NOT NULL,
[timestamp] [timestamp] NOT NULL,
[SessionDate] [datetime] NOT NULL CONSTRAINT [DF_SessionVisit_SessionDate] DEFAULT (getdate()),
[UserGUID] [uniqueidentifier] NOT NULL,
[CumulativeVisitCount] [int] NOT NULL CONSTRAINT [DF_SessionVisit_CumulativeVisitCount] DEFAULT ((0)),
[SiteUserId] [int] NULL,
[FullEntryURL] [varchar](255) NULL,
[SiteCanonicalURL] [varchar](100) NULL,
[StoreCanonicalURL] [varchar](100) NULL,
[CampaignId] [int] NULL,
[CampaignKey] [varchar](50) NULL,
[AdKeyword] [varchar](50) NULL,
[PartnerABVersion] [varchar](10) NULL,
[ABVersion] [varchar](10) NULL,
[UserAgent] [varchar](255) NULL,
[Referer] [varchar](255) NULL,
[KnownRefererId] [int] NULL,
[HostAddress] [varchar](20) NULL,
[HostName] [varchar](100) NULL,
[Language] [varchar](50) NULL,
[SessionLog] [xml] NULL,
[OrderDate] [datetime] NULL,
[OrderId] [varchar](50) NULL,
[utmcc] [varchar](1024) NULL,
[TestSession] [bit] NOT NULL CONSTRAINT [DF_SessionVisit_TestSession] DEFAULT ((0)),
[Bot] [bit] NULL,
CONSTRAINT [PK_SessionVisit] PRIMARY KEY CLUSTERED
(
[SessionGUID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [dbo].[SessionVisit] WITH CHECK ADD CONSTRAINT [FK_SessionVisit_KnownReferer] FOREIGN KEY([KnownRefererId])
REFERENCES [dbo].[KnownReferer] ([KnownRefererId])
GO
ALTER TABLE [dbo].[SessionVisit] CHECK CONSTRAINT [FK_SessionVisit_KnownReferer]
GO
ALTER TABLE [dbo].[SessionVisit] WITH CHECK ADD CONSTRAINT [FK_SessionVisit_SiteUser] FOREIGN KEY([SiteUserId])
REFERENCES [dbo].[SiteUser] ([SiteUserId])
GO
ALTER TABLE [dbo].[SessionVisit] CHECK CONSTRAINT [FK_SessionVisit_SiteUser]
I see SessionGUID and SessionVisitId, why have both a uniqueidentifier and an Identity(1,1) on the same table? Seems redundant to me.
I see referer and knownrefererid, think about getting the referer from the knownrefererid if possible. This will help reduce excess writes.
I see campaignkey and campaignid, again if possible get from the campaigns table if possible.
I see orderid and orderdate. I'm sure you can get the order date from the orders table, correct?
I see hostaddress and hostname, do you really need the name? Usually the hostname doesn't serve much purpose and can be easily misleading.
I see multiple dates and timestamps, is any of this duplicate?
How about that SessionLog column? I see that it's XML. Is it a lot of data, is it data you may already have in other columns? If so get rid of the XML or the duplicated columns. Using SQL 2008 you can parse data out of that XML column when reporting and possibly eliminate a few extra columns (thus writes). Are you going to be in trouble in the future when developers add more to that XML? XML to me just screams 'a lot of excessive writing'.
Mitch says to remove the primary key. Personally I would leave the index on the table. Since it is clustered that will help speed up write times as the DB will always write new rows at the end of the table on the disk.
Strip out some of this duplicate information and you'll probably do just fine writing a row each visit.
Well, I'd recommend NOT inserting a few k of data with EVERY page!
First thing I'd do would be to see how much of this information I could get from a 3rd party analytics tool, perhaps combined with log analysis. That should allow you to drop a lot of the fields.
25k inserts a days isn't much, but the catch here is that busier your site gets, the more load this is going to put on the db. Perhaps you could build a queuing system that batches the writes, but really, most of this information is already in the logs.
Agre with Chris that you would probably be better off using log analysis (check out Microsoft's free Log Parser)
Failing that, I would remove the Foreign Key constraints from your SessionVisit table.
You mentioned rowsize; the varchar's in your table do not pre-allocate to their maximum length (more 4 + 4 bytes for an empty field (approx.)). But saying that, a general rule is to keep rows as 'lean' as possible.
Also, I would remove the primary key from the SessionGUID (GUID) column. It won't help you much.
That's also an awful lot of nulls in that table. I think you should group together the columns that must be non-null at the same time. In fact, you should do a better analysis of the data you're writing, rather than lumping it all together in a single table.

SQL Server index advice performance

I'm looking for some advice to how to get the indexes running better on this query...
SQL Server 2005/8 some customers have 5 some 8...
SELECT sales.ChainStoreId,
sales.CashBoxId,
dbo.DateOnly2(sales.BonDate),
MAX(sales.BonDate),
SUM(sales.SumPrice)
FROM [BACK_CDM_CLEAN_BOLTEN].[dbo].[CashBoxSales] sales
WHERE sales.BonType in ('B','P','W')
AND Del = 0
AND sales.BonDate >= #minDate
GROUP BY sales.ChainStoreId,
sales.CashBoxId,
dbo.DateOnly2(sales.BonDate)
Table looks like the following
CREATE TABLE [dbo].[CashBoxSales](
[SalesRowId] [int] IDENTITY(1,1) NOT NULL,
[ChainStoreId] [int] NOT NULL,
[CashBoxId] [int] NOT NULL,
[BonType] [char](1) NOT NULL,
[BonDate] [datetime] NOT NULL,
[BonNr] [nvarchar](20) NULL,
[SumPrice] [money] NOT NULL,
[Discount] [money] NOT NULL,
[EmployeeId] [int] NULL,
[DayOfValidity] [datetime] NOT NULL,
[ProcStatus] [int] NOT NULL,
[Del] [int] NOT NULL,
[InsertedDate] [datetime] NOT NULL,
[LastUpdate] [datetime] NOT NULL,
What would be the correct ordering of the index columns, covered or composite etc.
The table has up to 10 mil rows. There are other similar selects but I'm hoping from the advice getting this one up to speed (Its the most important) I can tweak a few others.
Many thanks!
When you have your query in SQL Server Management Studio, just select "Analyze Query in Database Tuning Advisor" from the context menu, and off you go!
Mind you: this only tweaks this one single query in isolation! Adding indices here to speed this one query up might adversely affect other parts of your application. An index always comes with overhead - inserts and deletes tend to be slower.
Also, don't blindly implement all the recommendations of the DTA - use your own judgment as to whether an index makes sense or not.
And lastly: measure, measure, measure! Measure your performance before any changes as a baseline, then measure again and again after you've made the changes and compare.
My best advice is to run this query through the SQL Profiler. It will recommend some indexes for you to try.
Also, you might try setting up a partitioned table and use one of your GROUP BY columns as the partitioning key.
Off the top of my head I would start with
INDEX (BonType, Del, BonDate)
Or even just
INDEX (BonType, BonDate)
I would recomend using an Index Analyzer, Proflier and Benchmarking various combinations.

Resources